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Let he a, class of measurable functions on a measurable space 
(S, 5) with values in [0, 1] and let 



Pn=n ^ Sxi 



be the empirical measure based on an i.i.d. sample {Xi, . . . ,X„) from 
a probability distribution P on {S,S). We study the behavior of 
suprema of the following type: 



sup 

^<apf<S 



\Pr.f - Pf\ 



1 /2 

where crpf> Varp / and (ji is a continuous, strictly increasing func- 
tion with <j){0) = 0. Using Talagrand's concentration inequality for 
empirical processes, we establish concentration inequalities for such 
suprema and use them to derive several results about their asymp- 
totic behavior, expressing the conditions in terms of expectations of 
localized suprema of empirical processes. We also prove new bounds 
for expected values of sup-norms of empirical processes in terms of 
the largest apf and the L2{P) norm of the envelope of the function 
class, which are especially suited for estimating localized suprema. 
With this technique, we extend to function classes most of the known 
results on ratio type suprema of empirical processes, including some 
of Alexander's results for VC classes of sets. We also consider applica- 
tions of these results to several important problems in nonparametric 
statistics and in learning theory (including general excess risk bounds 
in empirical risk minimization and their versions for L2 -regression 
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and classification and ratio type bounds for margin distributions in 
classification) . 



1. Introduction. Let ^ be a class of measurable functions defined on a 
measurable space (5, S) and taking values in [0, 1] , let X, Xi , . . . , Xn, . . . , be 
a sequence of independent identically distributed S'-valued random variables 
with probability law P and let 

n 
i=l 

be the empirical measure based on the sequence Xi, that, as usual, we con- 
sider as a process on J^. Here is some notation that will be used throughout: 

o-p/ > Var}/^ /, ap := sup apf 

(usually apf will either be the standard deviation of / or \^Pf ). Given a 
continuous, strictly increasing function (/) with (j){0) = 0, we are interested in 
the behavior of suprema of the following type: 

\Pnf - Pf\ 

for some sequences r„,(5„. In particular, for given r„ and (5„, we would like 
to determine a normalizing sequence f3n such that 

l_ \Pnf-Pf\ 
Pn r„<apf<5„ 4>{(^pf) 

remains bounded or converges to a constant in probability or almost surely. 
We are also interested in conditions under which the sequence of stochastic 
processes 

nV^|P„/-P/| 

converges in distribution to a Gaussian process indexed by f £ J-. Such 
stochastic processes are often called normalized or ratio type empirical pro- 
cesses and the distributional convergence results are weighted central limit 
theorems for empirical processes. The study of these processes has a long 
history that goes back to the 1970s and 1980s when the classical case of 
:= {/(_oo,t] -t G R} was explored in great detail and definitive answers to 
most of the questions about the classical ratio type empirical processes were 
given; see, for example, [48]. In the late 1980s, Alexander, in a series of pa- 
pers [1, 2, 3], extended this theory to ratio type empirical processes indexed 
by VC classes of sets C (i.e., for T := {Ic'-C gC}). He discovered that in 
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this case the crucial role is played by the following functional characteristic 
of the class: 



which he called the capacity function of C. This function is involved in rather 
sharp and subtle exponential inequalities for empirical processes indexed by 
VC classes proved by Alexander. The behavior of the capacity function as 
5 — > happened to be closely related to the asymptotic behavior of ratio 
type suprema of empirical processes. In recent years, there has been a great 
deal of work on the development of ratio type inequalities, primarily, in more 
specialized contexts of nonparametric statistics (see [31, 46]) and learning 
theory (see [5, 6, 7, 9, 11, 26, 27, 28, 33, 37, 38, 39], etc.). These inequalities 
have become one of the important ingredients in determining asymptotically 
sharp convergence rates in regression, classification and other nonparametric 
problems and they proved to be crucial in bounding the generalization error 
of learning algorithms based on empirical risk minimization. 

In this paper, building upon our earlier work with Jon Wellner [22], we 
are trying to develop further a general methodology for proving exponential 
bounds and exploring asymptotics of ratio type empirical processes. This 
methodology is based on the deservedly famous Talagrand's concentration 
inequality [43] and on the simple idea of splitting the class into slices 
consisting of functions for which the values of (j){apf) are roughly the same. 
The empirical process on each slice is compared with its expectation using 
Talagrand's inequality and then all the pieces are put together using the 
union bound. This simple approach, called the method of slicing or peeling, 
proved to be rather successful in statistical applications (as in [9] or [31]) 
and it also allows us to obtain a number of sharp results on asymptotics 
of ratio type suprema (including weighted CLTs), essentially as a straight- 
forward corollary of Talagrand's inequality. The conditions of these limit 
theorems are expressed in terms of expected values of localized suprema of 
empirical processes (suprema over the slices). To translate these conditions 
into a more convenient language for special function classes J- one has to 
develop bounds on expected localized suprema. We prove such bounds (both 
upper and lower) under some conditions on random entropies of the class. 
Unlike most previously known bounds, the new bounds involve the L2 norm 
of the measurable envelope of the class which in applications to ratio 
limit theorems become the envelopes of the slices. These localized envelopes 
play about the same role in our theory as Alexander's capacity function 
plays for classes of sets (and, moreover, in the case when T \s a. class of in- 
dicators of sets the conditions on localized envelopes can be reformulated as 
conditions on the capacity function). We are trying to explore in this paper 
both the power and the limitations of the approach based on slicing and on 
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Talagrand's inequality, and to this end we provide some examples showing in 
which cases the conditions we obtain are sharp. Our main goal is to provide 
a link (that seemed to be missing) between the results for classical empirical 
processes of the 1970s and 1980s extended by Alexander to VC classes of 
sets and more recent results on ratios developed primarily in learning the- 
ory and based on Talagrand's concentration inequality. At the moment, our 
method allows us to generalize a number of Alexander's theorems to classes 
of functions, but some other theorems and exponential inequalities seem to 
be beyond the reach of our approach. On the other hand, most of his spe- 
cific corollaries for classical empirical processes in R*^ can be obtained by a 
slight modification of our method, consisting in further decomposing each 
slice corresponding to a small variance into a relatively small number of sub- 
classes with envelopes which are considerably smaller than that of the full 
slice. A bit surprisingly, the classical case of classes of sets of small entropy 
(which are needed to study the standard empirical processes) are harder to 
handle using Talagrand's inequality and general expectation bounds than 
the much more massive function classes commonly used in learning theory 
and nonpar ametric statistics. In part, this is related to the fact that the 
Poisson tail parts of the exponential inequalities play a more important role 
in this case, leading to more complicated asymptotic properties. 

Finally, we provide several applications of ratio type empirical processes. 
First of all, we derive in a much shorter way recent results of the second 
named author [25] on empirical margin distributions motivated by the needs 
of learning theory, specifically, the analysis of large margin classifiers. Sec- 
ond, we give general ratio type bounds on excess risk and derive from them 
upper bounds on excess risk in abstract empirical risk minimization problems 
and in a more specialized context of regression and classification. In particu- 
lar, this allows us to prove in a very easy way recent results of Tsybakov [44] 
on fast convergence rates in classification and, also for classification, to refine 
recent bounds of Massart and Nedelec [34], using a version of Alexander's 
capacity function. 

The article is organized as follows. Section 2 contains the general expo- 
nential bounds for ratio empirical processes. Section 3 is devoted to moment 
bounds for empirical processes whose metric entropy with respect to the em- 
pirical L2 distance is bounded by a regularly varying function independently 
of Pn, this includes, among others, VC-subgraph and VC-major classes. The 
reader interested in applications of the foregoing to ratios of margin distri- 
butions and to empirical risk minimization, may go directly from Section 3 
to Sections 6 and 7, where we deal with these subjects. Sections 4 and 5 are 
devoted to rates (a.s. and in pr.), local and global moduli and limit theorems 
(including the central limit theorem) for ratio empirical processes. 
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2. Concentration inequalities for normalized empirical processes. In this 
section we derive the basic inequaUties for ratio empirical processes. They are 
based on Talagrand's fundamental 1996 inequality, which will be formulated 
below. In what follows, {S,S) is a measurable space, P is a probability mea- 
sure on it, Xi are the coordinates 5^ i-^ S, Si are independent Rademacher 
variables independent of the variables Xi (defined on, e.g., ([0,1], A) and 
taking as O the product probability space ([0, 1] x , X x : = Pr)), T is 
a countable or suitably measurable (see, e.g., Dudley [17], Chapter 5) class 
of measurable functions on S and F is a measurable envelope of J-', that is, 
for all f € J-',x £ S, |/(x)| < F{x). For each n, P„ is the empirical measure 
n~^J27=i^Xi- As usual, we will also write ||V'(/)I|.F for supjgjp |V'(/)| • 



Talagrand's inequality. For any measurable, uniformly bounded class 
of functions T , 



Pr 



(2.1) 



1=1 



E 



>t 



<i^exp^---logfl + - 



valid for allt > 0, and where K is a universal constant, U is a uniform bound 
for the functions in T and V is any number satisfying V > E supjgjp SiLi f'^i-^i) 

The inequality holds also for {Xi} that are not necessarily identically dis- 
tributed. The quantity V is of course bounded by n||i^||2 if -F is a measurable 
envelope for the class a trivial bound that, however, can sometimes be 
used. A more interesting bound that follows from randomization together 
with a contraction principle for Rademacher processes is the following, given 
by Talagrand ([42], Corollary 3.4): 



(2.2) 



E 



1=1 



< na^ + WE 



where = supjgjp ii^/^(Xi) (see also [29]). Typically, Talagrand's inequality 
is used in combination with this bound for V. 

In the sequel, throughout, we may drop the subindex P in such notation 
as apf if no confusion arises, particularly in proofs. Given < r < 1 and 
r < 5 < 1, we set 

Hr)-={feT:ap{f)<r} and J'{r,s] := J^{s)\J^{r); 

for 1 < g < 2 and r < s < rq^ for some Z G N, we let 



Pj ■.= rq' 



J 



0,...,l, 
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and 



'>pn,q{u):=E\\Pn-P\\jr(^p^_^^p.^, U£{pj^l,Pj\, j = l,...,l. 

Of course, given 6 and r we take I to be the smallest integer larger than or 
equal to logg((5/r). Given a continuous, strictly increasing function (j) such 
that 0(0) = 0, we also define 

and 

Pn,q,(t>{r,s\:= sup ') , 

«e(r,s] (PqW 

and sometimes we may use instead the nondiscretized version, namely 



Pn,q,,p{r,s\ := sup 



E\\Pn - P\\j^(uq-'^,u] 



Some of the subindices of (3 may be dropped in proofs. We also set 



Vn,q{pj)=Vnipj) ■= -E 



n 



Y,{f{x,)-pff 



1=1 



and note that, by (2.2) and the comment before it, if Fj is a measurable 
envelope for J^{pj-i, pj], then 

(2.3) q-'p] < Vn {pj ) < {PFj) A [p] + 16^™ (p, )] , 

1/2 

where for the lower bound we assume that apf = Varp /. Finally, we let 7 
be the inverse function of 7^^(x) := 2;log(l + x), < 2; < 1. Note that 

2x 



log(l + x) ' 
2x 



log j; 



Denote 



E, 



n,q,t 



E sup 



for x>0, 

for x>2, 
for 0<a;<2. 

\Pnf-Pf\ 



feT 4>q{crpf) 

r<_a p f<S 



and 



'n,q,<p ■ 



<p ■ = Tn,q,(l,isi,...,Sl) 



2si 



max 



j : sj >2nT/„., {pj ) ncpipj ) log (Sj / {nVn,q {pj ) ) ) 



V max 2\ 

j: Sj<2nVn,q{pj) 



' SjVn,q{pj) 

n(l?{pj) 
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for any Vn,q{pj) > Vn,q{Pj)- 

The following immediate application of Talagrand's inequality holds the 
key to the ratio limit theorems to be obtained below. It shows that the 
supremum 

\Pnf-Pf\ 

~r( — TV 

fer (Pq{<TpJ) 

r <(Tp / <(5 

concentrates with high probability around both f3n,q,(j) and En,q,(j) with the 
same magnitude of the deviations (of the order T^^q^^). In particular, it also 
means that I3n,q,<j) and En^q,^ are within ~ of each other. 



Theorem 2.1. With the above definitions, there exist universal con- 
stants K,C £ (0,00) such that for any sequence Sj of positive numbers 



(2.4a) 



and 



(2.4b) 



Pr 



\Pnf-Pf\ . 

sup ——, TT Pn,q,. 

fer cPq{(7pf) 

r<iapf<.5 



I 



Pr 



\Pnf-Pf\ 

^^^p ~r( — TV 

feT (Pqycrpf) 

r<(Tpf<5 



n,q,< 



I 



<K^e~''^ 



— '^n,q,(f){si ) 



> 



CTn,q,(j,{si, ...,Sl) 



Proof. Set J^j := J^{pj^i,pj]. Then, we can rewrite Talagrand's in- 
equality as 



(2.5) 



Pr |||P„ - - E\\Pn - > VniPjhi 



nVniPj 



j = 1, . . . ,1. Hence, with probability at least 1 — X]j=i e~^^^^ , 

\Pn-P\\:F, 4'n,q{Pj) 



\Pnf-Pf\ . 

sup —— I3n,q,, 

feF <Pq[(Tpf) 

r<icTpf<S 



< max 

i<j<l 



< max 

i<j<l 



4>{pj) (l>{pj) 

ynipjhiSj/jnVnipj))) 
HPj) 
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Now, (2.4a) follows from the bounds for the function 7, namely, if Sj > 

logx 



2nVn{pj) we use 7(2;) < to get 



yn{Pj) -i{sJ{nVn{pj))) 



< 



2s i 



<p{pj) n(t){pj)log{sj/{nVn{pj))) 
and otherwise we use 7(3;) < 2^/x to get 



Hpj 

Next, we will show that 

Pnf - Pf\ 



VniPj) 'y{Sj/{nV n{pj))) jSjVnipj 



(2.6) Pr 
where 



ser (Pq[(ypj) 

r<apf<S 



n(j)'^{pj) ' 
> 2sTn,qA < C{{sj})eM-s/K}, 



i=i 

which is supposed to be smaller than 1 (otherwise, the inequalities of the 
theorem are trivial). Integration of (2.6) immediately implies that, for some 
C>0, 

\Pnf-Pf\ 



E 



fer (t>q{(ypf) 

r<rjpf<S 



<Ct, 



n,q,< 



and, as a consequence, 

Pn,q,(f> — P'n,q,(f> — Pn,q,(f> ~l~ CTji q ,j). 

The last bound shows that in (2.4a) /3n,g,(^ can be replaced by En^q^tf, if we 
multiply T„,5,<^ by a constant, which proves (2.4b). 

To prove (2.6), we again use (2.5) with Sj replaced by Sj + s. It is enough 
to assume that s,Sj > 2. The right-hand side of (2.5) becomes K exp{ — {s + 
Sj)/K}. If Sj > 2nVn{pj) (and Sj + s is even larger), we argue as in the proof 
of (2.4a) to get 

2{sj + s) 



< 



nVn{pj)/ n(l){pj)\og{{sj + s) / {nV n{pj))) 

2s i 



< s 



nHPj) ^ogisj/{nVn{pj))) 



< ST, 



n,q,(f> 



(using s,Sj > 2). On the other hand, if Sj < 2nVn{pj), we use subadditivity 

of 7, 

7(x + 2/) <7(x) + 7(2/), 
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which fohows from the inequaUty 

that is easy to prove directly. This gives 

^^M^) -""M^y 

The first term is bounded (as in the first part of the proof of Theorem 2.1) 
by 

The second term is dominated either by 

in the case when s < 2nVn{pj), or otherwise [if s > 2nVn{pj)] by 

2s 2 1 
= < s — 

n(p{pj)log{s/{nVn{pj))) log 2 ncpipj) 

which can be further bounded by 

2/^2^ 3 
log 2 V n^^pj) - 2 "'^•'^ 

[since we have 

nVn{pj)>Sj/2>l>l/sj]. 
The result now follows easily. □ 

We may want to normalize the empirical process Pnf — Pf by 4>{apf) 
instead of (t)q{apf); in this case we do not obtain a concentration inequality, 
but two very similar deviation inequalities (one from above and one from 
below), particularly if (p is regular enough. The above theorem gives the 
following: 

Corollary 2.2. Assume that the continuous nondecreasing function 
(p satisfies that the quantity Cq^r,<j> = sup^<^<i i;^(gx)/(/>(x) is finite for some 
1 < q <2. Then, with Sj and K as in Theorem 2.1 and under the same 
assumptions, we have both 

r<CTpf<& 
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and 

7-<crpf<S 3 



A way to use these propositions is as follows: if we let 

where g, r, 5 and {sj} may depend on n, and take Sj = Sj^n such that 
Ej=i e-"^-"/^ tends to zero, then the sequence 

1 \Pnf-Pf\ 

On /e^ 9\<ypj) 

r < (7 p / < (5 

is stochastically bounded. The following lemma of Alexander [2] allows one 
to get a.s. results. 

Lemma 2.3. With the same notation as above, let Cn/n [, rn i, \fnbn T 
and Un i . SeA 

An = {^/n\Pnf - Pf \ > Cn(j){apf) + n„ for some f with rn < <Jpf < (5„} 
and 

Al = {V^\Pnf - Pf\ > (1 - e)(c„0(ap/) + Un) for some feF 



with rn < (Jpf < y (1 + e)5n}, 

and assume 

inf{c„(^(t)/i: n>l, i G [r„, > 0. 
Then, if for some £,9 > 

Pr(^^) = 0(l/(logn)i+^), 

we have 

Pr(A„ i.o.) = 0. 



Sometimes, ipn,g [and therefore also Pn,q,(j) and Vn.giPj)] is still too large 
because the envelope of J-'j is too large. Then, one may further subdivide 
J-j into Nj classes J-'j^k with smaller envelopes and such that A'^' is not too 
large (perhaps of the order of logpj^). For instance, this happens with the 
(i-dimensional distribution function as we see below. One may take J-j^k to be 
the intersection with J^j of each of the components of an optimal covering 
of J-j by L2{P) balls of radius rpj, < r < 1, but other subdivisions are 
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possible; in particular, Nj could be 1 for some or all j. We can apply the 
same principle as in the proof of Theorem 2.1 and get a bound that takes 
this into account, as follows. Let J^j = Ufc=i-^j.fc) l^t 

'^n,q,j,k :=-E'||-fn - P\\:Fj^k, 
'4^n,q,j,k 



: max 



Vr 



n,q,j,k 



^n,q,<j> 



Y,{f{x,)-pfy 



i=l 



max 



'^^i.k 



i,fc:s,,fc>2nF„,,,,,fc n0(/)j) log(Sj_fc/ {nVn,q,j,k)) 
V 



max_ 2^ 

j,k: Sj,i,<2nVn,q,j,k 



' Sj,kVn,q,j,k 



and Vn,q,j,k ^ yn,q,j,k- Then, we have the following. 

Theorem 2.1'. With the above definitions and letting sj^k be a double 
sequence of positive numbers, there exists a universal constant KG (0,oo) 
such that 

I 



(2.4') Pr 



\Pnf-Pf\ -5 

~n — n — P 

feT CPq{(7pJ) 

r<cTpf<S 



n,qy 



i=ifc=i 



The analogues of inequality (2.4b) and of the one-sided inequalities of 
Corollary 2.2 hold as well. 

Remark 2.4 (On the choice of Sj). In general, we must take 

Sj = Klog — 

with X]j=i small, as in this case, Yl\=i e"*^/^ = Yl\=i Cj- If we take sj = s 
for a number of j's more or less comparable to I, then a good choice is to 
take 



s = K'\ogl 



for some K' > K, so that 



< l/l 



K'/K~l 
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which will tend to zero if / — > oo [so, if log^^(5n/?'n) oo]. Another possible 
choice is 



for some a > 0, which gives 



j=l 5 - j=l 



oo 



(2.7) < r e-'^^'^dx 



and 



(2.7') 



Q° - 1 S 



j^Q ~^ q^-ls) 



bounds that can be made small by increasing s. Finally, another choice is 

Sj = Sn + K\0g\0gq{q6/ Pj), 

as is easy to check. However, there does not seem to be a choice of Sj that 
works in all situations (see, e.g., the last part of Example 2.7 below for an 
unexpected choice for Sj). 

Remark 2.5 (The role of the stratification ^ Ui=i ^{pi-i,pi])- Since 
on each stratum !Fj ■= T(p^_j^^p^] the function of / i— > (t>{(Tpf) is essentially 
constant [assuming that 4){u) ~ ^(gu)], we have 



Pnf - Pf 



which is why the terms in the bounds (2.4) do depend on the complexity of 
these strata [measured by 'tpn,q{Pj) and Vn{pj), which ultimately also depends 
on 'iljn,q{pj)], usually simpler than the complexity of J-'. Instead of stratify- 
ing, we could simply apply Talagrand's inequality to the class of functions 
{4>{r)f /(j){apf): f G !F,apf > r}. But these classes are more complicated 
than and so would be the parameters of the inequality. These param- 
eters often depend on the L2 norm of the envelope of the corresponding 
classes, and there may be a good advantage in using the local envelopes 
sup{|/(2;)| :/ G Tj}/(j){pj) rather than the global sup{\f{x)/(j){apf)\ : f £ 



NORMALIZED EMPIRICAL PROCESSES 



13 



JF}. This advantage comes at a cost, at least with regard to distributional 
or in probability results: whereas the sequence Sj should be large enough so 
that the series converges and has a small sum, we do not have to 

deal with this series if we apply Talagrand's inequality to the whole class. 
In this last case s can be any number such that e~'^ is of the desired size. 
However, if one wants to apply Alexander's lemma, then s must be of the 
order of log log n, which in general is comparable to log/, hence to sj if Sj 
does not depend on j. This cost is usually overwhelmed by the mentioned 
advantage, and in the worst case, the number T„^q^0 in (2.4) is at most a 
factor of logZ, or even y/logl, larger than it would be by direct application 
of Talagrand, and not larger at all (except perhaps for a constant factor) if 
we want the probability bound to be of the order of l/(log/)^+^. 



Remark 2.6. Another approach, used, for instance, by Massart [31] or 
Bousquet [9], is based on stratification, but uses Talagrand's inequality only 
once, which is relevant to Remark 2.5, but which results in other losses when 
the class of functions is relatively small. We briefly describe this approach. 
Suppose that for all p > 

E\\Pn-P\\:f(^p)<Mp), 
where ipn is a function satisfying that for some A G (0, 1) the function p i— > 

1 1 



is nonincreasing. Assume also that with the same A 



3-Pj'- 

for some constant Cq^\^(p. Note that these conditions immediately imply that 



V^n(r) ^ 1 



< 



E 



Consider now the class 
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which is also bounded by 1. We have 



E\\Pn-P\\g< E 5?T^ll^n-P|h 



< E 



i'niPj) 



Using (2.2), this gives 



1 



Vn{g):=-E 



n 



J2{f{x,)-pfy 



i=l 



(l>{rn) 



^ s^P Tr^Pi + '^^^\\Pn-P\\g. 

j:Pj>r4>{Pjr 

We will assume that either pt-^ is nonincreasing (case 1), or it is non- 
decreasing (case 2). In case 1, 

Applying Talagrand's inequality to the class Q the same way we did in the 
proof of Theorem 2.1, we get that with probability at least 1 — Ke~^^^ 

\Pnf - Pf\ 



fer (Pq[crpJ) 

r<apf<S 



n,q,, 



1 



- P\\g - E\\Pn - P\ 



<I{s>2nVniG))- 



2s 



n(j}{r)log{s/{nVnig))) 
+ I{s < 2nVn{G))2i 



(2.8) 



>2 1 + 16C.A 



n4>{rY 

'4'n{r) 



2s 



n(p{r) log(s/ (nr2(l + 16cq,A,</.V'n(r)/r2))) 



+ /( ^<2( 1 + 16C, 



^n(r) 



X 2\ 



s 



1 + 16c, 
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Similarly, in case 2, 

Again, by Talagrand's inequality, with probability at least 1 — Ke~^^^ 

Pnf - Pf\ 



r <(Tp / <(5 



< I{s > 2nVn{Q)) 
(2.9)=/ 



E, 



n,q,< 



2s 



ncl){ry 



n(j){r)\og{s/{nVniG))) 



+ I{s<2nVn{Q))2i 



n0[r] 



>2 cs + lQc, 



'q,X,(f>- 



2s 



+ 1 



n(j){r)\og{s / {n(t){rY{c^ + lQcqX(l>'^n{r) / (l){rY))) 

Mr) 



n0[r] 



< 2 C0 + 16Cq,A,, 



4>[rf 



C0 + 16Cg,A, 



Bartlett and Mendelson [7] used another approach that also allowed them to 
apply Talagrand's inequality only once under an extra geometric assumption 
on the class (namely, that the class is star-shaped). 

Example 2.7. As an illustration of Theorem 2.1, we will recover known 
results on the a.s. and the in probability behavior of 



ln-= sup 

l/n<i<l/2 



\Fnit)-t\ 



where Fn is the empirical distribution function corresponding to n inde- 
pendent samples from the uniform distribution on [0,1]. In this case, = 
{l[0,t] : < t < 1/2}, apic = VPC, 0(t) = t, r = r„ = 1/^, S = 1/4, q is any 
number between 1 and 2, say 2, Z„ = (logn/4)/21og 2, Tj = {/[o,t] -t < p|}, 
we can take Vn{pj) = 2p| and 



i^niPj) < --Esup 



n 



t<p1 



i=l 



<'-E 

n 



1=1 



< 



n 



where the first inequality follows by symmetrization and the second by 
Levy's inequality. So, the quantity Tn,q,(j) of Theorem 2.1 is 



(2.10) 



2s,- 



j : ™>4^p2 npj log(sj/(2np2)) 



V2 max W— • 

j:Sj<W V n 
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Then, if we take sj = K'qlogln > -fC"loglogn with K" > AK, this bound is 
dominated by 

2K" log log n 
-y/nlog log logn 

and 

Ee-^/^^<(logn)-(^"/2^-^), 
i=i 

which, by Lemma 2.3, give 

ynlogloglogn 

limsup — J„ = G<oo a.s. 

n^oo log log n 

(as f3n < 4:/^/n multiplied by the factor in front of T„ tends to zero). This is 
sharp: Csaki [13] computed this constant, which is not zero. Suppose now we 
want to find the order of magnitude of T„ in probability. Then we still take 
Sj = K" log log n if K" log log n < Anpj and notice that the number of the re- 
maining j's, such that Anpj < ^C" log logn is of the order of iC'" log log logn; 
this allows us to take a smaller Sj for such j's, for instance, we can take 
Sj = 2K log log log n and still have 

,-s,/2K < log log logn (^K"/2K-l) _^ p 

p{ log logn 

The bound (2.10) becomes of the order 



log log logn ^ /log log logn ^ /log logn /log logn 



■y/n log log log logn \ n \ n \ n 

This gives that the sequence 



n 

-Tn, n G N, 



log log n 

is stochastically bounded, a result that is also best possible since it follows 
from [14] that it converges in probability to a positive constant. We should 
remark that these results on the almost sure and the in probability size of T„ 
can also be obtained by direct application of Talagrand's inequality to the 
class of functions {/[o^4]/\/t:t < 1/4}, although the estimation of expected 
values in this case is more complicated. However, they do not follow from the 
method developed in Remark 2.6. At any rate, this example illustrates the 
power of Theorems 2.1 and 2.1' when good expectation bounds are available, 
and also how to choose Sj. See Examples 4.9, 4.10 and 5.8 below for more 
on uniform empirical c.d.f.'s, in one or more dimensions. 
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Typically one wishes to normalize the empirical process by \/Vaxp~f, 
which corresponds to (j){x) = x, or by Pf, which corresponds to (j){x) = x^ 
and cTp/ = \JPf (recall < / < 1), or by a function of apf of the form 
= xL{x) with L slowly varying at zero. Although other situations may 
be considered, we will only specialize Theorem 2.1 to a small number of cases, 
including these. The main job consists in choosing Sj so that ^ e 
small. The following proposition recovers an inequality in [22]. 



Proposition 2.8. With the notation of Theorem 2.1 for apf = \/7J 
and 4>{t) =t^ , we have that, for all s > 0, 

Pnf 



Pr< g ^ sup 
(2.11) 



r'^<Pf<6 



Pf 



1 



>Pn,g,<P + 2j — {l + 16Pn,g. 



nr^ 



V 



2s 



nr2log((s/(nr2(l + 16/?„,g,^))) V2) 



1 1 



— 1 s 



and 



Vi\ sup 

r^<Pf<S 



Pnf 



Pf 



</3„,g,^-2W^(l + 16/3„ 



nr^ 



(2.12) 



V 



2s 



nrHog{{s/{nr\l + 16/3„,g,^))) V 2) 



1 1 



g2 — 1 s 



Proof. Take Vn{pj) =p'j{l + l6Pn) [which is allowed by (2.3)] and sj = 
sq^^ in Corollary 2.2, and use (2.7) with a = 2 to compute the probability 
bound. □ 

Especially important is the case (f){t) = t, that is, the normalization of 
the empirical process at / by the standard deviation of f{X). The following 
proposition is slightly sharper (up to constants) than a similar inequality in 
[22], and applies in a larger range. 

Proposition 2.9. Let 4>{t) =t. Set Cq = max.i<j<i^{logj)/q^ and de- 
note 

W^s/q + 2cqK^s + 2Kloglog^{q5/r) 
'~ nrlog{{{5s/q + lOcqK) / (Unr"^)) V 10) 
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V2Vl7i 



/ s + 2Klog log (gj/r) 



n 



(2.13) 



a-) If Pn = Pn,q,4> ^ ^/len, for any positive number s, 

Pnf - Pf\ 



Pr 



sup ——, 7^ - (5n 



>Bn\< 2Ke~ 



with obvious changes in the constants if f3n < Cr for some C < oo. 
(b) // Pn > f, then, for any s > 0, t > 0, 



Pr 



(2.14) 



\Pnf-Pf\ . 
sup ——, 7T Pr, 

feT (Pq[crpf) 

r<(Tpf<S 



<K^J—le-'/'' + 2Ke- 



> 



2t 



nrlog{{t/{17nrPn))y 2) 



V2^J^Vi34 
V nr J 



q-lt 

with obvious changes in the constants if r <CPn for some C <oo. 

Proof. Assume first /3„ < r. Since ipniPj)/Pj 1^ Pnl^r < pj we can take 
Vn{pj) = l7p'j. Now take Sj = s + 2Kloglogq{pj/rn) = s + 2Klogj if this 
quantity does not exceed SAnpj and five times this quantity otherwise. Then, 
to estimate 



2s i 



Tn,n,t 



max 



V max 



SjVnjpj] 



j:Sj>2nVu{Pj)npjlog{sj/{nVnipj))) s,<2nVu{Pj) \ np 

note that if x > e^, then x^^"^ /\ogx is nondecreasing, so that 

10s + 20iflogj 
nrg-? log((5s + IQK log j) / {17 nr'^q^^)) 



< 



lOs/g + 20cgi^y'l0s + 20i^ loglogg(g5/r 
nr log((5s/g + l{)CqK) / {17 nr'^)) 



which gives 



'n,q,t 



Moreover, KY}-Li < Ke~' J2T=i Vj^ < 2Ke-\ 

Assume now r < Pn- The j's for which pj > Pn can be treated as in the 
previous case (where all the pj were larger than or equal to /3„). If pj < Pn, 
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then Vn{pj)/pj < Pj + 16/?„ < 17/3„ and we take Vn{pj) = llpjfin- Then, with 
Sj = tq^ , the contribution of these j to Tn^q^t is easily seen to be dominated 



by 



2t 




nrlog((t/17?ir/?„) V2) 
and, by (2.7), their contribution to the probability bound is dominated by 

q—lt 

Comparing with inequality (2.8) in Remark 2.6, we see that the result 
in the previous proposition is better if in (2.8) we take s of the order of 
loglogg{q5 / r) because (pnir) >4'n{f)-, but smaller s's are possible in (2.8), 
and these do better than s + 2K \og\ogq{q5 / r) . 

By (2.3), if \\Fj\\2 is comparable to pj, then we can take Vn{pj) = — 
Vn{pj) and obtain better inequalities than (2.13) and (2.14); this is the case 
in Example 2.7 and, if one uses Theorem 2.1', this is also the case for the 
c.d.f. in several dimensions (as Alexander [2] observed and we see below). 
This applies also to Proposition 2.8 and to the ones that follow. 

Remark 2.10. If in the previous proposition we assume 



(2.15) 



rV/?„> 



I s + 2Kloglogg{q5/r) 
34n 



then the Poisson term of t„ g j can be deleted from the bounds; under this 
condition, if /3„ < r, then we see that Sj < 2nVn{pj) for all j. Under condition 
(2.15), if r < j3n, the same is true for pj > (3n- Thus we obtain 

\Pnf - Pf\ 



Pr 



(2.16) 

if fin < I"-, and 



fer (pq{crpf) 

r <(Tp /<(5 



, — /s + 2Kloglog„(g(5/r) , 
> 2Vl7\l '^^ ' ' \ < 2Ke-', 



n 



Pr 



\Pnf-Pf\ ^ 

r<i<7pf<.S 



> 



2t 

nrnlog{{t/{17nrPn)) V 2) \ nr 




(2.17) 



s + '^Kloglogg{q5/r) 



n 



<K^^^e-t/^ + 2Ke-' 
q-lt 



ifr<(3n. 
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Proposition 2.11. Let (t){t) = t° for some a G (1,2). Then, for any 
positive number s, 

10s 



Pr 



(2.18) 



\Pnf-Pf\ . 

sup ——, 77 (3n 

feT (Pq{(7pf) 

r<crpf<6 



> 



nr°log((s/(17nr"(r2-" V /?„))) V 10) 



's(r2-"V/5„) 



1 1 



s/K 



q'^ — 1 s 
where r = 2{a — 1). 



Proof. One proceeds as in the previous proof by considering the cases 
> and < Pn, which correspond to Vn{pj) = 17 pj and Vn{pj) = 
17p'j/3n, respectively; in the first case one takes sj = sq^^°'~^^^ or five times 
this, and in the second, Sj = sq°^^ . □ 

The bounds for (j){t) =t°', < a < 1, are similar to those for a = 1. We 
only state them in a case analogous to Remark 2.10. 



Proposition 2.12. Let 4){t) = t", q G (0, 1), and assume 
(2.15') 



s + 2i^ log \og Jq^ 6/ r) 



n 



(a) // (3n < r^~'^, then for all s > 
Pr 



\Pnf-Pf\ . 
sup ——: JT Pn 



(2.19) 



feT (t>q{(Tpf) 

<a-pf<S 



>2Vri\ 



n 



<2Ke~ 



where Cg,^ = supo<„<5g u^^^ loglog^{q'^ 6 / u) . 
(b) ///3„ >r2-", s,t>0 and 



2t 



n 



nr"^ log((t/(17nr"/3„)) V 2\ 



■ V2V17 



nr^ 



then 
(2.20) Pr 



\Pnf-Pf\ . 
fer cpq{(7pf) 

r <(Tp/<i5 



>S„J> <K^^_le~'/^ + 2Ke-'. 
- q'^-lt 
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Proof. Take Vn{pj) = Up'^Pn if P^f < Pn and Vn{pj) = I7p] other- 
wise. In the first case, set Sj = sq°'^ and m the second Sj = s + 2Kloglogg{q'^ 5 /pj) 
or e^/"/2 times this. □ 

The case (^(t) =t°'L{l/t) with L monotone and slowly varying at infinity 
is also easy to handle, and we will when needed. 

It should be noted that the bounds in the last three propositions are 
sharp only to the extent that the estimate Vn{pj) < pj + IGipniPj) is sharp. 
Sometimes the class J^j can be further decomposed into a relatively small 
number of classes J-j^k for which Kij,fc ^ cp^, and then it is Theorem 2.1' 
that gives inequalities leading to sharp results. 

3. Inequalities for expected values of suprema of empirical processes un- 
der uniform, regularly varying (or slowly varying) entropy bounds. We 

need good bounds for V'n(Pi) i^i order to apply the inequalities in Section 2. 
In this section we do this for a large collection of classes of functions that 
includes the ubiquitous VC classes. In the theorems below, we assume that 
the functions in take their values in [—1, 1] and they are P-centered, and 
F <1 denotes a measurable envelope of For each n, we set 

ll-^lb := ||^||l2(p)' ll^lb.n := Ili^lliaCPn)' nGN, 
and let a be a positive number such that 

(3.1) svLpPf<a^<\\F\\l 

fe:F 

unless we specify 

(3.2) a^ = supPf. 

We also let H : [0,oo) i-^ [0,oo) be a regularly varying function of exponent 
< a < 2, strictly increasing for x > 1/2 and such that H{x) = for < 
X < 1/2. Given such a function, we let the quantities Ch, Dh, Ah satisfy 



oo > Dh > j u "^^ H{u) du, 



oo > Ah > sup 5 V 1. 

x>2 

Finally, if {T,d) is a pseudometric space and e > 0, then N{T,d,e) denotes 
the e covering number of (T, d) (the smallest number of open balls of radius 
at most e needed to cover T) and D{T,d,e) denotes the e packing number 
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(the largest possible number of elements in T separated from each other by 
at least a distance of e), and recall the elementary inequality 

N{T, d, e) < D{T, d, e) < N{T, d,e/2) 

for all £ > 0, that we will use without further mention. 



\F\ 



2,n 



Theorem 3.1. // 
(3.3) \ogN{T,L2{Pn).T)<H 



for all T > 0, n £ N and lv then there is a positive constant C{H) that 
depends only on Ah, Ch and Dh, such that 



1 



-E 



i=l 



(3.4) 



C{H) 

< [V^llFlls] A 



\/na\ H 



2\\F\\ 



a 



V 



V a IUOCh J 



The bound 
1 



(3.5) 



C{H) 



E 



i=l 



< 



2\\F\\ 



V 



H 



2\\F\ 



a 



which also holds in general, is useful when na^ > c > 0. Finally, for any 
c> 0, if 



(3.6) 

then 

(3.7) 



na > cH I 1 , 



E 



i=l 



<K{H, c)^aJH 



T 



2\\F\\ 



a 



for a constant K{H,c) that depends only on H and c. 



Proof. We delete the subscript from norms when no confusion may 
arise. By standard symmetrization, ^|| E"=i /(^i)ll < '^F\\J2'l^^eif{Xi)\\. 
Set 

al:=\\Pnf^y. 
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The usual entropy bound for sub-Gaussian processes (for the constant, we 
combine the last display on page 320 of [29] with Theorem 11.17 and first 
display on page 322 of [29]) gives 



E 



2a„ 



(3.8) 



i=l 



<60E 



< 120E 



< 120E 



+ 120^ 



logN{J',L2{Pn),T)dT 



IH 



2,n 



IH 



2IIFII 



dr/(||F||2,n<2||F 



\F\ 



2,n 



dT/(||F||2,n>2||F 



Now, J^" ^H{\\F\\2,n/r) dr < ||F||2,„ ^H{l/u)du < DnWFh^n, and there- 
fore. Holder's inequality followed by Bernstein's gives 



E 



(3.9) 



\F\ 



2,n 



(ir/(||F||2,„>2||F||2: 



<L>i^||F||2exp --n||F||^ < 



9 



Dh 
1^ 



for the second summand in (3.8). To bound the first summand, note that by 
concavity of /q^ h(t) dt when h \, and by the properties of H, if ||(T„||2 < B, 



E 



2\\F\ 



(3.10) 



< E 



< 



T 

CTnA2|jF||2 



rfr/(||F||2,„<2||F||2) 



IH 







|(j„||2A2|iF||2 



2IIFII 



dr 



2\\F\ 



dr 



< 



BA2||F||2 



2\\F\\ 



dr 



<ChBJH 



2||F||2 
5 A2||F||2 
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Taking B = ||F||2 in (3.10), inequalities (3.8)~(3.10) give the bound 
E\\ YJi=i^if{Xi)\\ < m{DH + 2^/HtT}CH)V^\\F\\2, hence the first term in 
the minimum at the right-hand side of (3.4). This bound is accurate only 
if II-FII2 is comparable to a, and useful only if -y/n||i<'||2 is not too large. 
Otherwise, to get the remainder of bound (3.4), we use (2.2) to the effect 
that 



I \\2 ^ 2 . 



E 



n 



and take the right-hand side term of this inequality as B in (3.10). Inequality 
(3.10) with this B then gives, using (3.1), 



E 



\F\ 



2,n 



(ir/(||F||2,„>2||F||2) 



2IIFII 



a 



+ 



i=l 



IH 



2\\F\\ 



a 



2\\F\\ 



Combining this inequality with inequalities (3.8) and (3.9), and setting 
E ■■= E\\ YA=ieif{Xi)\\, it follows that 



either E < 120L>// or E <360CHVn(7\ H 



2\\F\ 



or E<8- 120^Cl 



H 



H 



2\\F\\ 



a 



A H 



F\ 



\ ^2E 



viffi: 



Now the result follows using elementary algebra, upon observing that if 
^{x) := x/i?(l/Vx), < X < 1, then ^-^[u) < u{H(l/^) VI), < m < 
l/H{l). □ 



It is easy to keep track of the constants in the previous proof, but not 
necessarily useful. 

Several remarks on the previous theorem are in order here. 

(1) One may ask for similar inequalities for higher moments. In fact. 
Theorem 3.1 together with Proposition 3.1 in [23], yields that there exists a 
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constant C{H) that depends on H only through Ah, Ch and Dh, such that, 
under the same assumptions as m Theorem 3.1, for all ?i G N and p > 1, 



E 



1=1 



(3.11) < C^{H) 



^'^^/^f j ^^l^^ 1440C^ 



(2) In (3.3) we could replace H{\\F\\2^n/T) by slightly more complicated 
expressions and the proof of the theorem would still yield sensible bounds; for 
instance, we show in Example 3.7 that for VC-major classes the right-hand 
side of (3.3) is of the form i/i(||F||2,„/T) + i72(||-F||2,n/T) log(A/||F||2,n), with 
Hi and H2 regularly varying of exponent 1, and with the whole expression 
monotone in ||-F||2,n) and in this case the proof of Theorem 3.1 works with 
only formal changes. 

(3) Note that it is the regular variation of H that allows us to replace the 
typical entropy integrals by actual entropies in Theorem 3.1. This is signifi- 
cant because it turns out that a partial Sudakov inequality for Rademacher 
processes due to Talagrand ([29], page 114, Proposition 4.13) allows us to 
obtain a lower bound for expectations that in some cases is of the same 
order as the upper bound (3.4). Here is this inequality applied to classes of 
functions whose absolute values are uniformly bounded by 1: 

Lemma 3.2 (Talagrand). There exists a universal constant L such that 



(3.12) E, 

whenever 
(3.13) 



i=l 



E, 



>y^logN{J^,L2{Pn),S), 



< 



L 



We will apply this result with 6 = a/8. In what follows, the function 
H satisfies the same conditions as in Theorem 3.1. Also, we set ||-F||2,n := 

l|-^IU2(Pn)- 

Definition 3.3. A class of functions JT that satisfies condition (3.3), 
that is, 



logA^(.F,L2(P„),r)<i/ 



\F\ 



2,n 
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for all T > 0, n G N and w G ri, is full for H and P if there exists c> such 
that 



(3.14) 



log N{J^,L2{P), a/ 2) >cH 



a 



Theorem 3.4. Let T satisfy condition (^3.3^. Assume 



(3.15) no-^> 2500 V 



IQAh 



na^ > [(672L^) V l]1920^CfjH 



6IIFI 



where L is the constant in Lemma 3.2. Then, 

n 



(3.16) 



E 



i=l 



>^JlogN{J-,L2{P),a/2). 



32L 



In particular, if a class T satisfies the entropy bound (3.3) and is full, then, 
for all n for which conditions (3A5) hold, 



16L 



\F\\ 



< E 



(3.17) 



i=l 



< 1920C/fV^ aslH 



2\\F\\ 



a 



Proof. By Talagrand's lemma above, 



(3.18) E, 

whenever 

(3.19) 



>^^^JlogN{r,L2{Pn),a/8), 



E, 



Y.^if{Xi 



i=l 



< 



64L 



Now we will lowerbound the right-hand side of (3.18) and upperbound the 
left-hand side of (3.19) with large probability. We start with the right-hand 
side of (3.18). Let D := D{J^ , L2{P) , o' /2) . By the law of large numbers ap- 
plied to D functions in and to F, for all e > there exists n and to such 
that 

DiJ', L2{P),a/2) < D{T, L2(P„(u;)), (1 - e)a/2) 
<N{T,L2{Pn{iv)),{l-e)a/A) 

and 



|i^llL,(P„M)<(l+e)ll^ll2, 
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so that, taking, for example, £ = 1/5, we obtain by (3.3) that 
(3.20) D{r, L2{P),a) < e^^^llFlh/a)^ 

Let /i, . . . , /d be a maximal set of J- satisfying P{fi — /j)^ > for all 
^ j — By Bernstein's inequality (e.g., in the form given in [8]), since 
moreover P(/, - fj)^ < AP{fi - fjf < lQa\ 



Pr|^^max^(^nP(/i - /,f - ^^(/^ - fjf{Xk)j > ft + V32tna^j < Z^V*. 
Hence, taking t = Sna'^ and using P{fi — /j)^ > and (3.20), 



Pr| min {f,- fA\Xk) < ( - -- 

which for 6 = 1/(32 • 8'^) gives 

Pr( min P„(/, - /,.)2 < ^1 < e^(6||F|!2/a)g-n.V(32.83)_ 

This implies that the event Ai on which 

N{J^,L2iPn),a/8) > D{T,L2{Pn),C7/A) 

(3.21) >D = D{T,L2{P),a/2) 

>N{r,L2{P),a/2) 

has probability 

(3.22) Pr(Ai) > 1 - e^(6|l^ll2/-)-n-V(32.83)_ 

Under the present hypotheses, (3.7) holds; actually, the proof of Theorem 3.1 
gives, before desymmetrizing, that 



E 



i=l 



2IIFII 



in particular giving the right-hand side inequality in (3.17). Therefore, using 
(2.2) and (3.15), we have 



E 



J2{f{x,)-pf) 



i=l 



<na^ + E 



i=l 



< 2na^ + 8E 



1 

n 



i=l 



< 2na^ + Ax 1920ChV^ cr\ H 



2\\F\\ 



a 



< 6na . 
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Hence, Bousquet's version of Talagrand's inequality ([10], Theorem 7.3; see 
also [32]) gives 



Pr^ 



i=l 



> 6na^ + V2Qtna^ + 1/3 i < e"*. 



which, taking t = 26na'^, becomes 



Pr^ 



i=l 



[Here we could have used Talagrand's inequality (2.1) instead of Bousquet's, 
but the resulting bound would have been less neat.] So, the event A2 where 



(3.23) 



i=l 



has probability 

(3.24) Pr{A2) > I - e-'^^'"'\ 

Also, by Bernstein's inequality, as mentioned above, the event 

(3.25) ^3 = {||i^||2,n<2||F||2} 

has probability 

(3.26) Pr(^3)>l-exp{-|n||F||i}. 
Now, on ^2 n ^3, the usual entropy bound and (3.15) give 



\ n ^ 



i=l 



< 120 / \ H 



\F\ 



2,n 



dT 



< 120 



42a- 



2\\F\ 



dT 



(3.27) 



< 60V42 



2a 







2IIFI 



dr 



< 120V^2Ch(T]^H 
,2 



\F\\ 



a 



< 



n a 



64L 



It follows from (3.18)-(3.27) that 
i=l T 



(3.28) E 



8L 



log N{T, L2{P),a/2) Ft{Ai n yls n A3 
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and that 

Pr(^l n ^2 n .43) > 1 - e^(6|lF|U/a)-naV(32-83) _ ^-2&na^ _ 

This last probabihty is larger than 1/2 by the inequalities in (3.15). So, 
integrating in (3.28) and desymmetrizing, we obtain inequality (3.16). The 
left-hand side of (3.17) now follows from (3.16) and Definition 3.3, proving 
the theorem. □ 



Theorem 3.1 recovers and improves on inequalities that go back to 
Talagrand [42] (see also follow-up work in [20, 21, 35] and, more recently, 
[22], where only the first and last of these four references use the L2 norm 
of the envelope in their inequalities). Theorem 3.4 shows that, at least for 
large n, these inequalities are sharp up to constants. 

Example 3.5. Suppose that T is VC-subgraph, that is, 

{{{x,t) : < t < fix)} : / G .F} U {{{x,t) : > t > f{x)} : / G J"} 

is a VC class of sets. Or, more generally, suppose is VC type, that is, 
there exist A> e and v > 1 such that 



N{J^,L2{Q),t)< 



A\\F\\l,(Q) 



for all < r < 2||i<'||2,2(Q) and all probability measures Q, where F := sup{|/| : 
/ G J^}. In this case H{u) = 'L'log(^u) is slowly varying (a = 0) and we can 
take Ch = 2, Dh = 2A^/v/e and Ah = A. 

Since subsets of a VC-subgraph class are also VC-subgraph, this can be 
applied to the class T{tq~^,t] = {/ G T:tq~^ < apf < t} with its measurable 
envelope, say. Ft. Define 



where ||-F||2 := ||F||^2(P)- Then the function loggg{t) plays the crucial role 
in the expectation bound for the class J^{tq~^,t\ (which is needed in the 
inequalities of Section 2 for ratio type suprema). In Sections 4 and 5 below, 
this function will be involved in conditions for limit theorems about ratio 
type empirical processes on VC-subgraph classes. It turns out that if is 
a class of indicator functions (i.e., we are dealing with a VC class of sets) 
such that PC < 1/2 for all Ic G then 

A'\g,{t)f/^ = P'V{C--Ic&:F,tq-^<apiIc)<t}] 

is comparable to (in fact, posssibly smaller than) Alexander's [2] capacity 
function g{t^). 
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Example 3.6. The scope of Theorem 3.1 is much larger than just VC 
classes. For instance, let T = {/„ := /^(„)/log(n V e), n G N}, with A{n) C 
[0, 1] independent for Lebesgue measure and with Lebesgue measure equal 
to 1/2 (introduced in [16], proof of Theorem 2.1), and let P be Lebesgue 
measure on [0, 1]. Then, F := 1 and a = 1/2. Also, considering the L2{Pn) 
balls centered on the first m functions, with m of the order of e^^^, it is 
easy to see that log N , L2{Pn) , £) is of the order of a constant times 1/e, 
independently of n. Then, Theorem 3.1 gives that E\\ J2i'=i f{^i)\\ ^ C'v^ 
for some fixed c < cxd, and this is best possible up to constants since is 
P-Donsker. 

Other classes whose covering numbers are not polynomial include VC-major 
and VC-huU (see [16] for definitions). We mention the definition of VC-major, 
that we use below: is VC-major if the collection of sets {{s € S: 
f{s) >t}:t£li,fG JT} is VC. The following bound on the entropy of such 
a class is, most likely, new. Note that, as in the case of VC-subgraph classes, 
it also involves the envelope of the class. 

Example 3.7. Suppose that is a measurable VC-major class of P- 
centered functions whose absolute values are bounded by 1. Our goal will 
be to show that there exists ^ > such that for all probability measures Q 
and all < r < 2||F||i2(Q) 

logiV(^,L.(Q),r) <^M3l^i,,^^M^ log(i). 

To this end, take tj := (1 +t)~^ , j > 0, and let ?7T,(r) be the smallest j such 
that tj < T||F||i2(Q). Clearly, 

log(l/(r||F||z.,(Q))) 
m[T) X — — . 

T 

For f £ J^, define 

m{T) 

fr:= Y.tjl{t,<f<tj.l). 

i=i 

If tj < f{x) < tj^i for j < m{T), then 

< fix) - fr{x) < - tj = Ttj < Tf{x) < tF{x). 

Hence, as soon as f{x) > t||-F||l2(Q) 

0<f{x)-fr{x)<TF{x), 

otherwise 

0</(x)-^(x)</(x)<r||F|U,(Q). 
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This implies that 

11/ - fT\\L2{Q) 
Denote J^r ■= {fr '■ f ^ Since 



<^2||^l|2 



i2(Q) +'^^ll-^lli2(Q) -2r2|l-^lli2(Q)- 



m{T) 

{ix,t):frix)>t}= U {x:/(x)>t,_i} x 
i=i 

and ^ is a VC-major class, the class is VC-subgraph with VC dimension 
bounded by Vm{T) for some V > 0. Clearly, F is an envelope of J-'r (since 
< /r < / for all f Therefore (see, e.g., [47], Theorem 2.6.7), for r > 

/ A\ Vm{T) 

N{^r;L2iQ);T\\Fh,^Q))<(-' 



which implies 



N{T;L2m3r\\FU,iQ))<[- 



ym(r) 



Taking into account the bound on m(r) and changing variables 'T'II-^HljCQ) 
r, the result follows. Note that the bound can be also written as 



log N{T,L2{Q),t) 



< 



A\\F\ 



L2{Q) 



+ log 



A\\F\ 



L2{Q) 



T 

A\\F\ 



T 



log 



A\\F\ 



L2{Q) 



■.= H(\\F\ 



lL2(Q)''^i' 

which can be used in the proof of Theorem 3.1 (with some modifications) 
to give 

n 



E 



i=l 



< {M\F\\l2{P){^ + Vlog(^ll^llL.{P))"' )) 



A [{y/TiaJH{\\F\\L2{P),<y) ) V i/(||F||i,(P),a) V Vl^]- 



Example 3.8. As a more specific example, consider the class J- of non- 
decreasing functions from [0, 1] into itself. Obviously, it is a VC-major class. 
Let P be a nonatomic probability measure on [0, 1] and let G be its distri- 
bution function. Denote 
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which, of course, is also a VC-major class. An easy computation shows that 
the envelope of is 

F^{x):= sup /(x) = ^===Vl 

(if X is such that P[x,l]> 6'^, then the supremum in the definition is attained 
at the function fx such that fx(y) = for y < x and fxiy) = , ^ for y>x; 

y/Plx,l] 

otherwise, the supremum is equal to 1). Let xs be such that 

P[xs,l] = l-Gixs)=6^. 

Then 

^« P{dx) 



P[x,l] 

/ 7^+^ =^ / — + logT2• 
l - G{x) J52 y o-^ 







Hence 



,5 V ^ 

and using this together with our bound on the entropy of VC-major classes 
in Theorem 3.1 yields, by a simple computation, 

E\\Pn - p\\t, < ^(iogri)3/4(iogiogri)i/2 



V -(iogr^)3/2(iogiogri) V 

n n 
So, in spite of the fact that the entropy of the class of monotone functions 
is relatively large, the supremum of the empirical process over the class J-'s 
of "small" monotone functions is of about the same size as for VC-classes of 
sets due to the small size of localized envelopes. 

4. Ratio limit theorems I: rates when </>(t) = t". In this section we 
will derive limit theorems a.s. and in probability for general ratio empirical 
processes, as direct applications of the bounds in Section 2, and we will 
specialize these to different types of classes of functions, particularly, VC 
classes, for which we will use the results from Section 3. 

4.1. The case 4>{t) = . We begin with a law of large numbers already 
in [22], Theorem 6. In this case we take (t|>/ = Pf (recall that the class J- 
consists of functions taking values in [0,1]). We set 

Hn,q ■ — l-'n,q,t^ — aLip 2 ; 

l<j<'n Pj 
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where pj = q>rn, l<j< with = loglogg(5q'/r„). 

Theorem 4.1. Let < 6 <l and Vn \ 0. Let I < qn<2 be a nonin- 
creasing sequence such that log(^„ — = o{nr'^). If nr"^ — > oo, then the 
condition 

(4.1) /3n,,„-0 
is necessary and sufficient for 

(4.2) sup 

rl<Pf<S 

in probability. Moreover, i/ 1 > (?n — 1 > (log n)~^ for some 6 > 0, nr^/ log log n 
oo and (3n,qn/ V^\^> then condition (^.1) is necessary and sufficient for 
the limit in ( 4.2j to hold a.s. 



Pnf 



Pf 



Proof. The "in probability" part of the theorem follows directly from 
Proposition 2.8 with s = Sn ^ oo such that Sn/inr'^) and s„/log(g^ — 
— > oo. The "a.s." part follows from Lemma 2.3 together with Proposi- 
tion 2.8 with s = s„ = (2 + 6)Kqn log log n. □ 

The condition nr^ — > oo is natural in this problem: it certainly is for 
T = {I[o,t] ■.0<t< 1/2}, since Er=i ^[o,i/n] -l^dN-1, where is 
Poisson 1. 

For a specialization of Theorem 4.1 to VC type classes of functions, ob- 
tained by replacing ipniPj) in the definition of En,q by its estimate from 
Section 3, see Theorem 10 in [22], which recovers classical results and com- 
pares to the sufficiency part of Theorem 5.1 of [2] if we restrict to VC classes 
of sets. 

Regarding rates, Proposition 2.8 also gives immediately the following: 

Theorem 4.2. (a) For qn G (1,2] nonincreasing, and with 7„ := nr^/ 
log((?n — 1)~^) the sequence 



1 



(4.3) 



A 



In 



1 + Pn,q„ 



A7„log(7„(l+A 



n,q„ 



V2' 



sup 

rl<Pf<S 



Pnf 



Pf 



n £ N, 



is stochastically bounded. 

(b) With qn as in part (a), if 

(4.4) sup^„,g„<oo and ^/y^(3n,q„ 



■ oo. 
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then 
(4.5) 



lim — sup 

rl<Pf<S 



Pnf 



Pf 



m pr. 



Lemma 2.3 and Proposition 2.8 also give almost sure counterparts of 
Theorem 4.2, that we leave to the reader. 

Here is an example showing that normings other than /3„ := Pn,q„ do occur 
in (4.3). 

Example 4.3. Our object here is to exhibit an example of a class of 
functions that satisfies 



(4.6) nr„ — > oo and ynrnfin 

for which the sequence 
1 



(4.7) 



-T sup 



Pf 



1 



n G N, 



is not stochastically bounded, but the sequence 

Pnf 



(4.8) 



Hn sup 



Pf 



1 



n G N, 



is. Let Ekj be independent random variables with 



Pr{efcj = 1} 



1 



1-Pr{efc, = 0}, j,A:GN, 



and 



:k = l,2,.. 



,(logj)2 

where logx := log(xVe) [and below, log log x = loglog(xVe'^)]. The variables 
Xk are i.i.d. Co-valued r.v.'s. 
Let 

:F={fj{x)=Xj:jGN}, 
where Xj is the jth coordinate of x G Cq, so that Pfj = (j log j)~^ and 

1 " 

(Pn - pm = ^r-w -r') 



and 



Pnfj 
Pfj 



a2 n 



n 



k=l 
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Set 

loen 



5=1 



n ^' 



Claim 1 . There is a permissible Qn such that 

yioglogn 

Proof. We can take 

Pn ■■ = sup 

u>r„ U 

\Pnf-Pf\ 

~ sup h sup 5 

u>r„ 



where 2 > g„ \ 1 is such that 

"l 



log ^ = o(nr^) = o((logn)^). 



In fact we take 



Hn — -L ~r ■ 



n 



In order to upperbound /?„ we note that the number of integers j such 
that < Pf < u^Qn, that is, such that {uqn)~^ < jlogj < , u > rn, is 
dominated by 

1 1 - 1 , - 1 . , 
= < < log n 

U uqn uqn rnQn 

because if F{x) =xlogx, a; > 1, then < 1. Moreover, the smahest 

j in this range, call it j(tt), satisfies 



1 1 \/n 



uqnlog{uqn) ^ r.„g„log(r.„g„) ^ 2(logn)2' 

Bernstein's inequality and Lemma 2.2.10 in [47] (a convexity argument 
due to Pisier) then give that, for some universal constant K, 

\Pnf-PfX 



Pn < sup E sup „ 

u>r„ \ fer 

< ^ 

~ u>F,^ nn2(log(l/(ugnlog(ng„)-i)))2 
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■ log(l + log 77,) 



+ y/nuqnlog{uqn) Vlog(l+log 



Since this bound is the sup of a decreasing function of u, we have 



5K 



< 



6K-yiog log n 



^log(l + log77) + (log?l)^A/log(l + log 77) 



(log 77)^ ' 

at least for all n large enough. Claim 1 is proved. □ 

It follows from Claim 1 that: 

(i) /3n ^ 0, and 

(ii) V^rA<'-^^^^^^0, 

in particular, (4.6) holds. From (i) and Theorem 4.1, we know that 

Pnf 



sup 

Pf>rl 



Pf 



in fact, from Theorem 4.2, 

(log 77)^/^ sup 

Pf>rl 



1 



Pnf 



in pr.; 



Pf 



77 G N, 



is stochastically bounded [note that -y/Tn — y/^'^n 

I v/log(g„ - is of the 
order of (log 77)^/^], so that (4.8) holds. Next we are going to show the fol- 
lowing: 



Claim 2. For any A„ — > co, the sequence 



A„ (log 77)3/2 g^jp 

P/>r2 



Pnf 



Pf 



71 G N, 



converges to infinity in probability, hence so does the sequence 



1 

— sup 

Pn Pf>rl 



Pnf 



Pf 



?^ G N, 



by (ii) above, = o(-v/IogTogr7/(log77)2). 
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Proof. Since j < ^/n/ (logn)'^ implies Pfj > r^, it follows that 



(4.9) 



sup 



Pnf 



Pf 



> max — 

i<v^/(logn)2 n 



> 



k=l 



max 



4(logn)^ V"/2{logn)2<i<v^/(logn)2 



k=l 



Now we estimate this supremum. First we note that, by direct computation 
(or, e.g., by Hoffmann-j0rgensen's inequality), if ^ is Bin(n,p) and np(l — 
p) > 1, then there is a universal constant c < oo such that E\(^ — np\'^ < 
c{np)^/'^, which, by Berry-Esseen, implies 



(4.10) 



Fi{^-np<tJnp{l-p)} -Fr{g<t}\ < 



C 



n 



for another universal constant C, where g is standard normal. Hence, for 
any A = An>0, 



Pr^ 



max 

x/n/2(logn)2<j<Vn/(logn)2 



k=l 



>A 



(4.11) 



fc=i 



= l-n 1-Pr 

>l_J|('l_pJ \g\ > 



A 



>A 



Vnj Vi -r 



+ 



2C 



where the product is over the set of j's such that ■y/n/2(log?i)^ < J < y/n / {\ogn)^ . 
Now, 



A 



< 



2A 



V^J-Vl - J"^ (logn)2 
and, by well-known Gaussian computations, for 2 A > (logn)^, 



Pr||c/| > 
Hence, taking 



A 



(logn) 



Vr J2A/(logn)2 



4^2 



(logn)4 ■ 3 



logn. 



we get 



Pri 1^1 > 



A 



(log 77.)^ 



C ^ 1 2C _ 2c^(logn)- 
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with Cn — > oo [cn is of the order of -n}/^ /2{\ognY\. Replacing this estimate 
into (4.11), gives 



Pr 



> 1 - 6-^= 

Hence, by (4.9), 



max 

V^/2(logn)2<jr-<v/H/(logn)2 



k=l 



> 



2^3 



(log n) 



5/2 



2cn{logny \ 



2x v^/(2{logn)2) 



Pr<^ (logn)3/2 sup 

Pf>rl 



Pnf 



Pf 



> 



8^3 



proving Claim 2. □ 



4.2. The case (f){t) = t. Here is a result on convergence in probability and 
stochastic boundedness of the "normalized" empirical process. It expands 
Theorem 1 in [22]. 

Theorem 4.4. Let (j){t) = t, 5 <l, r„ \ and, for I <q<2, let f3n,q 
denote j3n,q,t- Set 

\Pnf-Pf\ 



in ■■= sup 

'•jl<CTp/<(5 

Then the following statements hold: 



crpf 



(a) If for all q G (l,a) for some a>l, V ^ = o{Pn,g), then 



-^1 > 1 in pr.; also, there are sequences qn\l 



such that 



1 in pr. 



log log 1 An w 1 

n nr„ 



0{(3n,q), then the sequences 



(b) If for some q> I, 
-J^ and are stochastically hounded. 

(c) If for some q> I, = 0{Pn,q) and nrnl3n,q 0, then the 
sequence [nr„ log(l/(nr„/3„ ,j))],^„ is stochastically bounded. 

(d) Let _i_y^!2gl2|l/!ii ^ and ^ = 0{Pn,q) for some q>l; if more- 

logiog" i/r ^"^ stochastically bounded, and oth- 



over, rn > 



then ^ 



erwise, 



n 



loglog„ l/r„ 



A 



nr„log((nr„) V 2) 
loglogg^ l/r„ 



en 



IS. 
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(e) Let Y — > oo and nrn(3n,q — > for some q> 1. Then, if 



n nrnlog{{nrl) ^ V2)\ 

^ /, , , , = 



loglogl/r„ Vloglogl/r„ 
is stochastically bounded, and otherwise 



nr„log(l/(nr„/?„J) A ,/- — A — nJ ^ — ; 

yioglogl/r„ 7loglogl/r„ / 



IS. 



Proof, (a) In this case condition (2.15) is satisfied and we can apply 
inequalities (2.16) and (2.17) with s = Sn ^ oo so that s = 0(loglog(l/r„)) 
and t = tn^oo so that t„ = o{nrnPn,q)', then the lower bounds Tn^q^t for 

\Pnf-Pf\ . 

sup ——. 7T Pn,q 

r„<apf<S <pq{apj) 

in these inequalities are o{l3n,q)- Now the result follows because t < 4>q{t) < qt 
and f3n,q < En,q < (3n,q + Cr„_g [see the proof of Theorem 2.1 for this last 
inequality, which holds when the probability in (2.4a) is less than 1]. 

(b) Follows from similar considerations. 

(c) In this case (2.15) is still satisfied (at least up to a multiplicative con- 
stant whose only eff'ect is in the multiplicative constants in the probability 
inequalites) . Then, since n/?^ ^ — > oo, necessarily > r„ from some n on, 
and inequality (2.17) applies. Under the hypotheses of (c), we have (with ^ 
signifying "little o") 



V n " y ^j,^ nrnlog{nrnPn) ^ 

so that t times this last term is the dominant one in Tn^q^ti inequality (2.17). 

(d) In the first case, f5n,q <C ^/n~^ log logn < r„ and (2.16) applies. Oth- 
erwise we must use (2.13) and (2.14); for (2.14), note that if I3n,q > rn, then 



n 



nr„ log(10 V inrn(3n,q) ^) ^ \l nvr. 
(e) Follows using (2.13) and (2.14), from similar easy considerations. □ 

A similar result for the a.s. size of ^„ can be obtained as well. One applies 
the same principles but makes sure that Lemma 2.3 is satisfied. For instance, 
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direct application of Remark 2.10 and Lemma 2.3 gives that if 



^-^\,n'l\n/ and /l"glog V loglogn ^ loglogn ^ ^^^^^^^ 



/n V ^ "^^r, 

for q G (1, a), then 

hmsup < 1 a.s. 

To show that this Um sup is actuahy equal to 1, apply Remark 2.10 for 
n/j = and Borel-Cantelli, as in the second part of the proof of Theorem 2 
in [22]. 

Example 4.5. We now modify Example 4.3 to show that the condition 



/loglogg^K 1 . 

V n nrn 

from part (a) of Theorem 4.4 has some degree of sharpness (the problem 
with absolute sharpness is that we are not using the exact value of f5n.q to 
violate the condition, but only an upper estimate), so that our example will 
satisfy 

nrnf3n,q <K <oo, 

but it might well be that actually nrnl3n,q 0, which might be too strong 
a violation of the condition. 
We consider 

\Pnf-Pf\ 

U = sup 

rn<cTpf<5 (ypj 

from Theorem 4.4 with, for example, 6= 1/8, and (see also Section 2) 

(3n= sup -e( sup \Pnf-Pf\ 

MG(r„,(5] ''^ \u/qn<(Tpf<u 



with gn \ 1 and r„ \ 0. Take Ekj and as in Example 4.3, and = 
= 2, . . . ) G CO. Since Varp(/j) = jijj^^ - jr^j^^ is of the order of 

l/(jlogj)^ for large j and (3"^ ^ \fn^ taking crp{f) = l/(jlogj)^ will be 
equivalent to taking <Tp(/) = Varp(/). Now define 

V n log log n 

If, for u G (r.„ , s] we set 

Ju = {j ■u/qn< crp{fj) < u} 
= {j ■u/qn< 1/ {j log j) < u}, 
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we have 

j{u) := mm{j:j G J„} > 

and 

Card(J„) < ^ - - < < logn. 

U U Tn 

Using Bernstein and Lemma 2.2.10 in [47], as in Example 4.3, we obtain 
1 



u 



E\ sup \Pnf-Pf\ 



u/qn<<Tpf<U 

K 



< 



nti log (l/(u log u ^)) 
Taking the sup over u G (?"„, (5) we obtain that 



- log log n + y/nu log u \/\og log re 
3 



I log log re 



re 

for some other K and for all re large enough. So, as mentioned, we have 

nrnPn < K. 

Now we show that (,n/f3n ^ oo in probability (actually faster than any 
rate An such that , ^" — > 0). First we observe that 

■\y log n I log log n 



Og re ^-|^^2)^ n log logn/ log n<j<^n log logn/ log 



fc=i 



-2^ 



and then we easily check, proceeding as in the previous example, that 



Pr(2K|i>lJ Ul 
I /3„ 4\/loglognJ 

as n — > oo. 

Next, we will consider the case of VC classes of functions, for which we will 
obtain a result that, although it falls short of recovering the full strength of 
Theorem 3.1 in [2] when restricted to classes of sets, still gives best possible 
results up to constants in the classical situation of the uniform empirical 
c.d.f. in several dimensions, indicators of intervals for the uniform, and half- 
spaces for the normal (Corollaries 3.5, 3.7 and 3.9 there). 

We refer to Example 3.5 for the definition of \C-subgraph classes of func- 
tions and recall that, by a result of [40], reproduced in [15], if JT is a bounded 
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VC-subgraph, there exist A > e and v > 1 such that, for every subclass 
C JF, if G is a measurable envelope for Q, then 



A\\G\ 



L2{Q) 



N{Q,L2{Q),t)< 



for all probability measures Q and < r < 2||G||^2(Q)- Hence, by Theo- 
rem 3.1 there exists a constant 1 < Ki < oo such that if is such a class 
and, moreover, it is suitably measurable and consists of functions taking 
values in [0, 1], then for all Q ^ 



E 



J2{fix,)-pf) 



1=1 



(4.12) 



V^||G||2 A V^agJlo, 



A\\G\\ 



ag 



Vloe 



M\G\\ 



A V^||G||2 VI 



In particular this applies to the classes T{tq~^ = {/ G T :tq^^ < apf < 
t}. Letting Ft denote a measurable envelope of T{tq~^,t\, we define (as in 
Example 3.5) 



(4.13) 



A\\Fi 



t\\2 



t 



0<t<l, 



where ||F||2 := ||F||i2(p)- 

Assume that <Jp{f) [which is always > Varp^(/)] satisfies the following 
condition: 



1/2, 



2 < Cap{f) 



with some constant C > 0. 

Recall that, given f_, £ L2{P), the set 

[/-,/+] :={/G^2(P) :/-</</+} 

is called an L2(P)-bracket of size (or of order) 5 > iff ||/+ — /_||2 < 5- It 
will be said that satisfies the local bracketing condition iff there exists a 
constant K > such that for all / E J-" and <6 < ap{f)/K there exists an 
L2(P)-bracket [/_,/+] of size K5 such that 

{g^:F:\\g-fh<5]ci[f-,U] 

[in other words, L2{P) balls of radius 6 are to be covered by L2(-P)-brackets 
of size K5]. 
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Given < r < 6 < 1 and < g < 2, with pj = rq^ , j = 0, 1, . . . , / = logq{6q/r), 
we also define 

(4.14) w = w{r) = max (loglog (^g/pj)) V {loggg{pj)). 

0<]<l 

In the fohowing theorem, we go as far as we can toward extending 
Alexander's [2] Theorem 3.1 to classes of functions. 

Theorem 4.6. Let I <q<2 and r„ 0. Let T be a VC-subgraph class 
satisfying the local bracketing condition. Then the following hold with Wn ■= 
w{rn). 

(a) // liminf„ nr^/wn >0 (infinity not excluded), then the sequence 

W \Pnf-Pf\ 

— sup 



Wn fer crpf 

is stochastically bounded. 

(b) // lim„ nr'^/wn = and log = 0{e'^ '^") for some t' > 0, then 

nrnlog{wn/nrl) \Pnf - Pf\ 

sup 

Wn fer crpf 

rn<(Tpf<S 

is stochastically bounded. 

(c) These statements with stochastic boundedness replaced by Urn sup 
finite a.s. (in fact a constant) also hold with Wn changed to Wn = Wn V 
(log log n) in assumptions and conclusions, under the extra hypothesis that 
Wn/n? i and — — — 9,.,^- I. 

To prove the theorem, we start by adapting Theorem 2.1' to this situation, 
using the bound from Section 3. Let us set up the simplifying notation 

:=j^(pj_i,pj] 

and denote as Fj a measurable envelope of J^j . 

Lemma 4.7. Let J- be a (measurable) VC-subgraph class of functions 
taking values in [0, 1], and let A, v, Ki, < r < 6, < q < 2, pj, J^j, I, 
gq and w be as above. Assume further that for each j for which npj < w, 

J'j = [JkLi^j,k and J-j^k has an envelope Fj ^ satisfying ||-Fj,fc||2 ^ L^2Pj for 
some K2 > 1 [i.e., decomposes into Nj L2{P) -brackets of size of the order 
of Pj]. Then, for s < 2K2W, 

pJ sup \P^f-Pf\ y qe^sL{7ir^<w) 
I fer apf ~ nr{lV log{e'^s/{K2nr'^))) 

r <cTp/ <(5 
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(4.15) 



+ 2q{l7Ki + K1K2 + Kl)\ — 

V n 



< Kw exp 



-UKiw 



1 



w 



+ max AT,- )( l + -logg— 2 )/(nr <u;)exp( — 



K 



Proof. We will apply Theorem 2.1'. Set J = {1,...,0, Ji = {] ^ J ■ 
np^j < w} and J2 = J \ Ji- For j £ Ji, we define 



'^n,j,k — —E 

n 



J2{fix,)-pf) 



1=1 



and 



Vr 



n,j,k 



-E 



n 



Y^{f{x,)-pfy 



i=l 



which we upperbound by (4.12) and (2.3) as 



^/n ^/n 



and 



For iG J2, by (4.12) 



<KAJ-y — <^i\/- 

V V re npj J y n 



(note that w>l) and, by (2.3), 

K(Pi) < P- + 16V'n(Pj) < l7Kip^j ■.= Vn{pj). 

Then, 



I3„ nt= max ' V max 



<KiK2 



Now, for J € J2 we take 



Sj = 34Kii(; < 2nVn{pj 



so that the contribution of J2 to T^^q^t in (2.4') is just 2>AKiyj2w /n and the 
contribution to the probability bound, i^y'e(34-?^i/^-i)«'^ por j G Ji, we take 
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Sj,k = s < 2K2W if w < nr"^ and Sj^k = otherwise. Then the contribution 

of {(j, k):j e Jl] to Tn,q,t IS 

e^sliw > nr"^) Pis 



nr\og{e'^s/(K2nr'^)) V n ' 

on account of the fact that / log x is increasing for x > e^, whereas the 
contribution to the probabihty bound is dominated by 



k( maxiV. (Card Ji)e~'/(2^) < 2K( maxN. 1 + log„ ^ e" 
Collecting bounds, the lemma follows. □ 



■s/K 



In the previous proof, we could take s = e^K2W which dominates e'^nVnj^k 
for j G Jl, and obtain 

Vt\ sup > — 2 — iT-— + qKi{U + K2)\ — 

r <(Tp/<(5 

(4.15') <KweyiY>{-UKiw) 



+ 2K{ max Nj 



1 + logo — ^ I li^r"^ < ^) exp{-e'^K2w/K) , 



but in situations when Nj is small (e.g., a constant, as in the case of the 
uniform empirical c.d.f. in R) we should take s of a smaller order. 

Note that replacing w hy cw, < c < 00, in the hypothesis of the previous 
lemma yields the same conclusion up to constants. 

Theorem 4.6 follows at once from (4.15'): 

Proof of Theorem 4.6. First, for any slice !Fj, we construct a par- 
tition {J-^j^k : 1 < ^ < -^j}' ^ needed in Lemma 4.7. To this end, consider a 
minimal covering of JTj with L2(-P)-balls of radius pj/{Kq) and define J-'j^k 
as the intersection of J-j with the kth ball in the partition (if it is empty, 
discard it). By the bound on the covering numbers of subclasses of a VC- 
subgraph class, the number Nj will be upperbounded by [Kq)"" gq{pj), which 
is in turn upperbounded by ce^'^^P^^ for some c, r > 0. By the local bracketing 
condition, for any k = 1, . . . , Nj there exists an L2(-P)-bracket [fj^k-jfj,k,+] 
of size Kpj covering the class Tj^k- If we set Fj^k '■= fj,k,+, Fj,k becomes an 
envelope of Tj^k and for arbitrary / G Tj^k 

\\Fj,kh < \\fj,k,+ - fj,k-\\2 + \\fj,k-\\2 

< Kpj + II/II2 < Kpj + Cap{f) < {K + C)pj := K2Pj. 

Now we are in a position to apply directly inequality (4.15') [together with 
Lemma 2.3 for part (c)]: increase if necessary the constant K2 so that 
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K2/K — T — t' is positive for (a) and (b), and is larger than 1 for (c); 
for (c) we should also increase Ki so that 34:Ki/K > 1. The a.s. limit is a 
constant by Borel-Cantelli. □ 

Remark 4.8. 1. Suppose that for ah t the "slice" J^(tq~^,t] is fuU for 
P and H{u) = vlog Au with a = t (recall Definition 3.3 and Theorem 3.4). 
Then, in the case (a) of Theorem 4.6 and under additional assumption 

^ 00, 



loglogq(l/r„) 
we have for some C > 

pJc-><,/^ sup IMzZZI<c| 



as n ^ 00. 



rn<<^pf<S 



For example, it is clear that the (VC) class C of all closed (or open) intervals 
in [0, 1] is full for the uniform distribution and so are any of the slices {C £ 
C:tq^^ < VPC < t}. Then, gg{t) ~ l/t^. Take r„ = y/{logn)/n, which yields 
Wn = Wn — logn, so that Theorem 4.6(b) and (c) give 

\PniC)-PiC)\ 

limsupw- sup 'F^^ = L<oo a.s. 

n^oo \ ^Ogn ceC:logn/n<P{C)<l/2 \/P\C) 

Then, the class C being full, the above limit implies that L > 0, a result 
first obtained by Shorack and Wellner [41] (L < 00), Yukich [49] {L > 0) and 
Alexander ([2], Corollary 3.9, equation (3.12)), where he also obtains it in 
several dimensions. See also [30]. 

2. Note that the conclusions of Theorem 4.6 are also true if we only 
assume (instead of the local bracketing condition) that is as in Lemma 4.7, 
except that now, the bracketing condition of this lemma holds for all pj with 
Nj < ce^^'^Pj) for some c, r > 0. In principle, the condition ||-Fj,fc||2 i$ Pj can 
be replaced by weaker assumptions on the local envelopes -Fj,fc, which would 
give rise to different rates. 

Alexander [2] does not have an equivalent of the local bracketing assump- 
tion in his Theorem 3.1 for VC classes of sets. At this moment, we do not 
know whether this assumption is needed because of our method (based on 
combining Talagrand's concentration inequalities and expectation bounds of 
Section 3), or if it is unavoidable in some form for function classes. However, 
this assumption holds in all classical examples of classes of sets to which 
Alexander's Theorem 3.1 applies. Suppose, for instance, that S= [0,1]*^ for 
some d > 1 and P has a density that is uniformly bounded and bounded 
away from on S. For C C S closed and 6 > 0, let be the set of all points 
in 5 that are within a distance < 6 from C and let be the set of all 
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points X such that the closed ball of radius 5 around x is included in C. 
Denote h the Hausdorff distance between closed subsets of S*, that is, 



Let C be a VC class of closed convex subsets of S such that, for some A' > 
and for all Co G C with P(Co) > 0, 



as soon as P(CACo) < P{Co)/K. The upper bound of this inequality al- 
ways holds for convex sets (see [17], pages 269-270), but the lower bound 
is satisfied only for special classes of sets (balls, rectangles, etc). Denote 
ap{Ic) '■= \JP{C). Then the class T := {Ic : C S C} satisfies the local brack- 
eting assumption. The proof easily follows from several simple properties of 
convex sets described on pages 269-270 of [17]. Indeed, if Co € C and < 



5 < ^P{Co)/K, then P(CACo) < 6^ < P{Cq)/K implies that /i(C,Co) < 
KP{CACo) < KS^. It follows that, for a = K6^, Cq"" C C C . Hence, 

{Ic :\\Ic- Ico h = ^/P{CACo) <S}C [/^-. , leg]. 



the above inclusion provides a bracket of the size needed in the local bracket- 
ing condition. Quite similarly, one can check the condition for VC-subgraph 
classes of concave (i.e., with a convex subgraph) functions on [0,1] as well 
as for some other examples of function classes. 

As an illustration, we apply Theorem 4.6 to the uniform empirical c.d.f. 
in R'^ ([2], Corollary 3.5). 

Example 4.9 {The finite-dimensional uniform empirical c.d.f.). Let P 
be Lebesgue measure on [0, 1]*^, d>l, denote by x^ the coordinates of points 
X G R"^, let T = {/[o,x] :0 < x' < l,U.f=ix' < 1/2} and take cJp(I[o,x]) := 
(nf=ixO^/^. Then J" is VC of index v = d + l ([17], Corollary 4.5.11) so 
that (4.12) holds with this v, and some A. It is also easy to see that H-FjUi = 



P{x^ ■■■x<^< p]} ~ 2"^-^ pj {log pjy-^/{d- 1)1, so that g{pj) ~ {log pj^)^'^^-^^^ 



The local bracketing condition holds by the argument given before the 
example for convex sets in general. So, we can apply Theorem 4.6 with 
Wn — loglogr"^ and Wn — loglogn, assuming, for (c), that loglogr"^ is not 
larger than a constant times loglogn for all large n. The conclusion is: sup- 
pose r„ — > and log log r~^/ log log n — > c > 0; then. 



h{Ci,C2) := mi{5 > : Ci C C2 C Cf }. 



K-^h{C, Co) < -P(CACo) < Kh{C, Co) 



Since also with some constant K' 



P{CS\C,n<K'cT, 



nr: 



,2 




liminf 



n 



>0 



n 



log log n 



X 




sup 



< 00 



a.s. 
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and, assuming loglogn/(n'^/2r„log(loglogn/(nr^))) |, 

,. nrl ^ ?ir„ Iog(log log n/(nr2)) 

lim - — = =^ lim sup — 

" log log n n log log n 

|F„(x)-F(x)| ^ 
X sup — < oo a.s. 

In particular, this last limit allows us to recover the tightness part of a 
limit theorem of [19], as follows. For dimension d>2 and e > 0, take = 
— . Then, the last inequality gives 



n(log n)'*' 

^ |i^^(x)-F(x)| ^ 

limsup- . , , sup , — < oo a.s. 

„ (logn)(<^-l)/2 ,(„(iog„).-.)-i<nti-.<l/4 ^/uL^^ 

A simple computation shows that if are d independent random variables 
uniform on [0, 1] , then 

d 



( d 



which, by another simple computation, allows us to conclude from the pre- 
vious limit that the sequence 

n |F„(x)-F(x)| 
sup 



(logu).-«/^nt.i./. 
is stochastically bounded. 

Example 4.10 (Example 2.7, revisited). Theorem 4.6 essentially does 
not distinguish between stochastic and a.s. boundedness for the empirical 
c.d.f. However, Lemma 4.7 does when d = l. For d = 1 we can take s of the 
order of \og{wn/nr'^) since Nj is constant (as we did in Example 2.7). If 
r„ > logloglogn/y^rHoglogn, then Wn — log log n, and s/nrn is dominated 

by \J '^°^ ^'^ that Lemma 4.7 shows that the sequence 

H \Fn{x) — x\ 

sup 



loglogn^2<^<i/2 

is stochastically bounded. Then, since Pr{mini<„ Xi < e/n} < e, we get that, 
for d = 1 , the sequence 

~n \Fn{x) — x\ 

sup 



loglogno<2:<i/2 

is stochastically bounded. By the limiting result of Eicker [18] (see also 
[14, 24]) this rate is exact. 
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4.3. The cases <j){t) = t", a ^ 1,2. For a G (1,2) direct application of 
Proposition 2.11 gives the following analogue of Theorems 4.1 and 4.2: 

Theorem 4.11. Let a G (1, 2), <6 <l and Vn [ 0, and set 

\Pnf-Pf\ 
■■= sup . 

r„<apf<S [(Tpj)" 



(a) If nr" oo, then the condition (3n,q ■= Pn,q,t'^ for some q > 1 
is necessary and sufficient for — > in probability (a.s., if we also have 
nr "/ log log n —> oo ). 

(b) // sup„/3.„^g < oo and nr'^Pn,q oo for all q £ (1, 1 + 5) for some 
6 > 0, then 



in probability (and the convergence is a.s. if nr"(3n,q/ ^oglogn ^ oo). 
(c) For any q> 1, the sequence 



AnrMinCirt'^ APn,qr') V 2) A )g„ 

\Pn,q ^ rn V Pn,<j / 

is stochastically bounded. 

For VC classes of functions, adapting the proofs of Lemma 4.7 and The- 
orem 4.6 to the case of a G (1, 2] only gives the obvious: for instance, that if 
in Theorem 4.6 we replace apf by (apf)°' in the displays, then multiplying 
by the corresponding expressions produces sequences that are stochas- 
tically bounded [or a.s. bounded in part (c)]. This observation applies, for 
instance, to give the tightness part of the remaining cases in [19], namely, 
that, just as in Example 4.9, for 1/2 <v< \ (the case = 1/2 is covered by 
that example), the sequence 

n^-" |F„(x)-F(x)| 
sup 



d 

1=1 '■ 



is stochastically bounded, where we assume d > 2. Extensions of Exam- 
ple 4.10 to powers of x different from 1/2 are equally easy to get in the case 
d = l (they are omitted) . 

For a G (0, 1), we make the rates explicit only under condition (2.15') and 
the result is a direct consequence of Proposition 2.12. 
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Theorem 4.12. Let a G (0, 1), <6 <1 and Vn [ 0, set 




and assume 



a/(2-a) > 



3K log logq {q^S/r) 



n,q 



n 



for all q & {1,t), for some r > 1. Then: 

(a) ifnr'^ — > oo, then the condition (3n,q — > for some q> 1 is necessary 
and sufficient for — *■ in pr. ; 

(b) if \/n(3n,q — > for q G (1,t), then Cn/E^n 1 in pr., and there are 
sequences qn\l for which (,n/f3n,q„ I in pr.; 

(c) if l3n,q < C(n^^/^ A for some C < oo and q > 1, then the se- 
quence y/n^n is stochastically bounded, and 

(d) if, for some < C < oo and q > 1, r^~°' < Cj3n,q < n~^/^, then the 



is stochastically hounded. 

For VC classes of functions, one obtains analogues of Lemma 4.7 and 
Theorem 4.6 for a G (0, 1) as follows: under the hypotheses of Lemma 4.7, 
the bound in (4.15) holds for the probability 



And under the hypotheses of Theorem 4.6, except that we replace 1/2 > 5 > 
r„ by 1/2 > 5.„ > r„, we have: 

(a) if liminf^ n(r"5^~")^/i(;„ > (infinity not excluded), then the se- 
quence 



sequence 





r<crpf<S 





sup 

rn<typf<S 



Pnf - Pf\ 



is stochastically bounded; 
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(b) if lim„n(r^(5i-")Vwn = and log ^ = 0(e^"'") for some r' > 0, 
then 

nrnlogjiVn/nr^) \Pnf - Pf\ 

r„<CTp/<c5 

is stochastically bounded; 

(c) the corresponding statements for asymptotic a.s. boundedness under 
monotonicity conditions analogous to those in Theorem 4.6(c). 

5. Ratio limit theorems II: asymptotic continuity moduli and weighted 
central limit theorems. These two types of limit theorems usually involve 
functions of the form (j){t) = tL{l/t) where L is nondecreasing and slowly 
varying at infinity. 

5.1. Local and global moduli. Local asymptotic moduli in probability for 
general classes of functions were already treated in [22], Theorems 4, 5 and 
9. Here we will only derive an a.s. general result which is the companion 
to Theorem 4 in the just mentioned reference. As usual, ^ is a measurable 
class of functions taking values on [0, 1] . 

Following Alexander [2], a local asymptotic modulus of the empirical pro- 
cess over at is an increasing function u) for which there exist r„ < 5„ < 1 
both nonincr easing, with \/nbn nondecreasing such that 

(o.lj iimsup sup — — - < oo a.s., 

where 

l^n{f)-=V^{Pnf-Pf) 

is the empirical process indexed by J^. 

Although the results below are under our general assumption that the 
functions in the class J- take values in [0,1], in the case when (Jp{f) := 
\/Yar p{f) and hence crp{f + c) = ap{f), ap{cf) = \c\ap{f) for any constant 
c, a simple rescaling allows one to deal also with arbitrary uniformly bounded 
classes of functions. This is of importance in the case of global moduli. 
A global asymptotic modulus of the empirical process over is any local 
modulus for J^' = {f — g : f , g £ J^} at 0, with 5„ » \/n~^ logn. 

Theorem 5.1. Let q > 1, r„ < 5„ < 1 both nonincreasing, with \fnbn 
nondecreasing, and let to be a bounded nondecreasing function on [0, 1] such 
that 



Uj{t) > \/nil)n,q{t), t £ [r„,5„], 
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for all n, and satisfying that uj{u)/u |, 



5n Jlog log n V log log^ {6nq^/rn) 
sup — ^- jj- < CX), 

log log n V log log {5nq^/rn) 

sup r < CX) 

and that these two sequences decrease when divided by n. Then, the limit 
(5.1) holds. 

Proof. We apply Theorem 2.1 and Lemma 2.3. Let K, which we can 
assume to be larger than 1, be as in (2.4a). We take [see (2.3)] 

VnM = L{p] + 16a;(p,)/V^) > L{p] + 16Vj„,g(p,)), 

for j = 1, . . . where In is the smallest integer j such that pj = rnq-' > 6n, 
and where L is the largest of K and the second supremum above. Then, if 
we take sj = 2K (loglogn + loglogg{6nq'^ / Pj)) , we have Sj < 2nVn,q{pj), and 
inequality (2.4a) directly gives 



Vrlq sup — > l + 2maxW^-2/^-^ ^ < 

Now the theorem follows from Lemma 2.3 and the hypotheses on uj. □ 

Let Fq^u = sup{|/| :/ G J^,u/q < apf <u}, 1 < g < 2, < u < 1, be the 
local envelopes for J^, and define gq{r) as any nonincreasing function satis- 
fying q\\Fq^r\\L2{P)/^ 1^ Oqir) < q/r. By proceeding as in Theorem 9 in [22], 
Theorem 5.1 gives that, for any bounded VC class of functions, the func- 



tion LOi{t) = loglog(l/t) -|- loggq{t) is a local asymptotic modulus at and 

that the function ujQ(t) = ty/log{l/t) is a global modulus, thus generalizing 
Theorem 4.1 of [2] to classes of functions (and demonstrating the same dif- 
ference between local and global continuity moduli as in the classical cases 
of Brownian motion, Brownian bridge and univariate empirical process). For 
the global modulus, one takes gq{u) = q/u. 

5.2. Central limit theorems. We consider here weighted CLTs for empir- 
ical processes in the spirit of Alexander [3]. Let ip he a, strictly increasing 
continuous function such that 

(5.2) V'(0) = and lini^^ = cx). 
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We call such a function a weight. We will find conditions on -0 and a de- 
creasing sequence r„ so that, for a P-Donsker class of uniformly bounded 
functions we have 



in ioc{J^\J^o) and the limiting process Gp/^oap is sample continuous 
on JT \ ^0 for the pseudo distance dp{f,g) = ap{f — g), where J^q := {/ G 
T :apf = 0}. (For definitions of P-Donsker or CLT(P) classes, pre-Gaussian 
classes, and others associated to uniform central limit theorems, see, e.g., 
[17] or [47].) 

We need to comment on condition (5.2). For classes of sets, this condi- 
tion is necessary for Gp/ipo ap to be a.s. in £oo (Lemma 5.1 in [4]) but 
this is not so for classes of functions: just consider = {af : < a < 1} 
for some bounded function /. Then, Gp{af)/ap{af) does not depend on 
a and the sample paths are just constants. However, if the class J- is suf- 
ficiently rich, then (5.2) is also necessary; for instance, assume that J-' is 
convex and symmetric (i.e., fi £ J-' and X^finitc 1-^*1 — 1 implies J2^ifi ^ 
and that the subspace of L2{^) generated by the process Gp{f), f & J^, is 
infinite dimensional (if it were finite dimensional, we would be in the case 
of the finite-dimensional central limit theorem). Then, by Gram-Schmidt 
orthogonalization, there exists an infinite sequence of functions fi in !F such 
that apif,) / and EGp{fi)Gp{fj) = if i /j. But then GpUi)/ap{h) 
are i.i.d. A^(0, 1) and their sup is infinite with probability 1. We are thus 
justified in assuming condition (1) for our weights. 

Another useful remark is the following: 

Lemma 5.2. Assuming (5.2) and T P -pre- Gaussian, if Gp/ip o ap is 
dp sample continuous on (meaning that it has a version with hounded 

and dp -uniformly continuous sample paths), then 

hm — — = a.s. 

Proof. If ap{f), / G JF\ jTo, is bounded away from zero, then there is 
nothing to prove. Otherwise, let /„ G ^ be such that crp/„ 0. Then, the se- 
quence Gp{fn)/'4'{<ypfn) is a.s. Cauchy by hypothesis, and since E{-^^^j^)'^ - 
by (5.2), it also converges to zero in probability. Hence, this sequence con- 
verges to zero a.s. □ 

The following proposition, which is analogous to Theorem 4.2 in [4], will 
allow use of the inequality in Theorem 2.1. From now on we will assume 
without loss of generality that J^q is empty and that the functions in take 
values in [0, 1]. 
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Lemma 5.3. Let T he a measurable class of functions, let ^ be a weight 
function as defined above and let rn — > 0, r„ > 0. Then, 



in Iqo{J~) and the limiting process Gp/ip o ap is dp sample continuous on 
J- if and only if both 

^>r ■= {/ G J^-crp{f) ^ '''} is P-Donsker 

and 

(5.3) lim limsupPrl sup > el = 0. 

rn<apf<6 

Proof. If the weighted processes converge in law and the hmit is dp 
sample continuous, then, by the continuous mapping theorem, J^>r is P- 
Donsker. Also, by the portmanteau lemma, 

p / Wn{f)\ ^ l^p / |gp(/)l ^ 1 

limsupPr< sup — 77 > e ? < Pr< sup — jr>e>, 



n L fer 1p{(ypf) J \f&T,cTpf<&i^{(ypf) ) 

rn<crpf<S 

which, by Lemma 5.2, tends to zero as 5 — > 0. The direct part follows as in 
[3]. □ 

Theorem 5.4. Let r„ ^ 0, < r„ < 1/2, and let he a weight function 
such that supo<a;<i/2 = C < 00. Assume 



r^loglogg,^ 1/r 
(5.4) limlimsup sup — —r = 

S^O n re{r„,5] '^{r) 



and 



loglogg„ 1/rr, 



(5.5) , 7 ' >0 as 00, 

VAi'n)\/n 

where 2 > qn\ 1 or Qn = c. Then, the conditions J->r G CLT{P) for all 
r > and 

(5.0) Imilimsup sup —-^ = U 

are necessary and sufficient for the process :|^f~jj) / ^ -^j to be dp sample 
continuous and for 

ny^Pn-Pm ..w .C_Gp{f)_ . 

(5-7 -7 Tjrz /o-p/)>r„^— — m£oo{^). 
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Proof. By Lemma 5.3, the proof is basically the same as that of The- 
orem 5.1. Define 

e{n,6)J^^^^y sup 
which, by (5.5) and (5.6), satisfies lim5_»o limsup„ e(n, 5) = 0, and then, 

>K[p] + 16^|;n,gApJ)], 

where K >1 comes from (2.4a), and where, as usual, pj = rnqli so that it 
depends on n even if we do not show it. Note that Vn.q„{Pj) is admissible 
in Theorem 2.1 by (2.3). Now set sj = 2K[t{n,6) + loglogg^{dq'^ / pj)], with 
t{n,5) = minpoglogg^ r'^ ,mir„<r<Sq,, '4^(r)/r], which satisfies 

(5.8) lim liminf (^) = oo, 

because r„ — > and -0 is a weight function. Note also that, by the hypotheses, 
Sn,j < 2nF„^g„(/0j) for all 1 < j < iniS), where in{S) is the smallest integer j 
such that rnQ^ > 6. Therefore, Theorem 2.1 or Corollary 2.2 gives 



Fri C sup — > sup 



+ 2 max W '^'^"f ^f^'^ 1 < i^e-2*(".^) . 

3 V r{Pj) J 



Now, condition (5.6) implies that limg^o lim sup„ sup^ ^^"('p")^^'' ^ = and 



moreover, smce 



1 SjVn,qr,{pj) ^ Pj ^ 32e(n,(^)loglogg^r„ 



2K^ ij^p,) - i,{pj) ^iPirn) 
p] loglogg J6q'^/pj) 

^Hp,) 

equations (5.4) and (5.5) imply 



(5.9) limlimsupmaxJ^^%^ = 0. 

Therefore, (5.3) holds, and by Lemma 5.3, so does (5.7). Conversely, if (5.6) 
does not hold, while we still have (5.8) and (5.9), the term limsup„ sup^ ^^"p")^^^ ^ 
stays bounded away from zero for a sequence 6k — > 0, and this implies, by the 
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second inequality in Corollary 2.2, that (5.3) does not hold, and therefore, 
by Lemma 5.3, neither does (5.7). □ 

In order to apply the above theorem, one needs to have reasonable esti- 
mates of 'ipn,qii"), and it is here where the results in Section 3 may become 
useful. 

In the case of the classical VC-subgraph classes (the uniform empirical 
distribution function, indicators of intervals for the uniform law on the unit 
cube, or half-spaces for the normal), the above theorem does not give best 
possible results, just as in the case of (^(x) = x in Section 4.2 [see (4.6)"(4.8)]. 
As in that section, we will prove a theorem that handles these cases, but we 
will only apply it to the multidimensional empirical c.d.f. 

For a VC-subgraph class J^, or more generally for a VC type class [i.e., one 
such that for any ^ C ^ and any probability measure Q, N{Q,L2{Q),t) < 
{A\\G\\L2{Q)/'r)'" for some A>e,v>l and all < r < 2||G||£,2(q)], and given 
< r < (5 < 1 and 1 < g < 2, let gq{t) and w be as defined in (4.13) and (4.14). 

Theorem 5.5. Let T he a VC-subgraph class satisfying the local brack- 
eting condition and let r„ — > 0, q £ (1,2]. Let 

(5.10) (p{t)=tL{l/t), 0<t<l, 

with L{u) y oo as u y CO and u^L{l/u) nondecreasing for some < r < 1. 
Assume T>t G CLT{P) for all r > 0. Then, the conditions 



(5.12) lim ' . . = and lim — — — — - — — ^ = 0, 

u^o L{1/u) n^oo Vnr„L(l/r„)log(t(;„/nr^) 

where Wn = w{rn), imply that the Gaussian process Gp{f)/(j){apf), f £ T , 
is sample continuous and 

I{apf >rn) ^ jr inloo[^)- 



4>i(^pf) 4>{crpf) 
For the proof, we begin with the analogue of Lemma 4.7. 

Lemma 5.6. Let T he a VC type class of functions satisfying the same 
hypotheses as in Lemma 4.7 (the bracketing properties). Let L,(j),T be as in 
Theorem 5.5 and set 7 = 2/(1 — r). Let < r < 6 < 1 and q G (1,2]. Then, 
there is C = C{Ki,K2,q) such that, for all n G N, 



^4 #71^^^ 

{r<apf<s (p{crpf) 



u+ yjlog gq{u) 

/w/n<u<5q 



max , , , 

L(l/n) 
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+ 



(5.11) 



nrL{l/r) log{e'^ w / (nr^)) 



1 



W 



K ( max Nj][l + - log^ — ^ ) I{nr^ < w)e 



-e''K^w/K 



with notation as in Lemma 4.7. 



Proof. We will apply Theorem 2.1'. As in Lemma 4.7, let Ji = {j : npj < 
w} and J2 = J \ Ji, where J = {1, . . . ,£}. Then, on Ji , 



n 



and on J2, by (4.12) 



PjJ\oggg{pj) 



n 



So, 



max 



Vmax- 



ij,k):j£JiPjL{l/pj) ieJ2 PjL{l/pj) 



K1K2 

< — ^ — . . . V max 



^oggqipj] 



I log gq{u) 
< K1K2 max , , . 

^Jw/n<u<5q 

since log^g > 1. For j G J2, we take Sj = 2nVn{pj) = SAKinpj > SAKiw, so 
that the contribution of J2 to Tnj^rf) is 



2 max , 



max 



whereas the contribution to the probability bound is 

i<:(CardJ2)e-34-^i"'/^ < Klog{6q/r)e-^^^''^/^^ < Kwe-^^^^""'^ . 
To keep things simple, on Ji we take 

Sj = e^s := e^K2W > e'^K2np'j = e^Vnj^k- 



Then, the contribution of Ji to r^j-^^ is 



max 



< 



e'KlwI{nr'^ < 



w 



j&Ji y/nrq^L{l/rq^)\og{e^w/{nr'^q^^)) y/nrL{l/r)\og{e"iw/{nr'^))' 
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where the inequahty holds because L{\/x) is increasing and u^^""^)/^/ logn 
is increasing for u > e^/^^""^-* = e"' . Finally, the contribution of Ji to the 
probability bound is 

K (^maxNj^{CaTdJi)e-^^'^>/^ < ^(^niaxiVj^ (^1 + ^ log e-^^^a'"/^. 

Now the lemma follows from collecting the above bounds and plugging them 
into inequality (2.4'), Theorem 2.1'. □ 

Remark 5.7. The bound (5.11) can be slightly refined by taking Sj = 
s < e^K2W for j € Ji such that s > i^l^^pj and Sj = SAKipjn for the 
remaining j's in Ji. Then, the contribution of Ji to r becomes 

e'sI{Klnr^<s) ^ 

+ ^A2- 



JffmaxiV, l + -log-^]e-^^/^. 



/ErL{l/r){lVlog{e-ys/{Klnr^))) L{^/^)' 
and the probability contribution is 

1 w 

V V" 2 

The resulting inequality is analogous to that of Lemma 4.7, whereas (5.11) 
is more similar to (4.15'). 

Theorem 5.5 is an immediate consequence of Lemmas 5.3 and 5.6. 
The following example shows how this theorem recovers the (sufficiency 
part of the) results in Example 2.9 of [3]. 

Example 5.8 (The finite-dimensional uniform empirical c.d.f.). In the 
case of .7^ = {I[o^x] : < < 1, nf=i ^* ^ 1/2}, P being the uniform measure 
on [0, l]'^, as shown in Example 4.9, gq{pj) — (log/jj^)^'^^"^)/^ and the class 
satisfies the local bracketing condition. As long as log log r,^^ is of the same 
order as log log n, we have Wn — log log n. Then, the first condition (5.12) re- 
quires L{u) (log log n)^/^ as n ^ cxD. To illustrate, take L{u) = (loglogti)" 
for a > 1/2. Then, the second condition (5.12) readily implies the CLT for 

any r„ satisfying r„ 3> ^^og^(^\ogn ' '^^ic'^ best possible for d > 1 [3]. How- 
ever, as in the case of Theorem 4.6, this is not sharp for d=l: in this case, 
since in Lemmas 4.7 and 5.6 Nj = constant, we can take a smaller s and 
still have the probability bound that results from Remark 5.7 tend to zero, 
for instance, s = log(u;„/nr^). It is easy to see that if L(n)/loglogn— > oo as 
ti — > oo, then Remark 5.7 implies the CLT for r„ of a strictly smaller order 
than 1/y/n {rn = log log log nj^/n log log n ) , which by the same argument as 
in Example 4.10, implies that we can take r„ = 0, namely, one obtains the 
well-known Cibisov-O'Reilly CLT for the weighted uniform empirical c.d.f. 
in R [3, 12, 36]. 
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So, Theorem 5.5, perhaps complemented by a modification along the lines 
of Remark 5.7, does give results comparable to those in [3] for the classical 
classes of sets and, moreover, it applies as well to classes of functions. 

6. Applications I: ratios of margin distributions. The goal of this section 
is to suggest a much easier approach to the proofs of some of the results of 
Koltchinskii [25] on bounding margin distributions. The motivation and the 
terminology come from learning theory: functions / below represent what 
is known as "classification margins." "Large margin algorithms" tend to 
output functions (classifiers) / whose empirical distribution is shifted in 
the positive direction. The question is whether the true distribution is also 
shifted in the same direction. Since we are interested in the values of the 
margin for which these distribution functions are small, it is natural to study 
their ratios. See [25] for a detailed discussion. 

Let 

Ffi6) := P{f < 6}, F^j{6) := Pn{f < 6}. 
Suppose that J- is a class of functions such that 

Ve>0 logiV(^;L2(Pn);e)< 

with some constants D > and a G (0, 2). 

For two distribution functions F,G and interval (a, 6), define 

Ma,b{F;G) :=loginf{c> l:VtG (a, 6): <cG(ct) and G{t) < cF{ct)}. 

If F, G are distribution functions on the positive real line [i.e., F{0) = G{0) = 0], 
then M{F; G) := Mo,+oo(-^) G) is a metric (a multiplicative version of Levy 
distance). We want to study the closeness of Fnj to Ff in distances of this 
type uniformly in f £ J^. Unfortunately, the metric M itself cannot be used 
even in the case of a single function / (the range of t's in the definition has 
to be restricted). However, define for A > 

6n{f; A) := inf{6 > n"^ : 52"/(2+")Fj(5) > An-2/(2+")}. 

Theorem 6.1. // A„ ^ cx) as n ^ oo and 

supP{/>i}^0 as t—>oo, 

then 

snpMs^(^f.,x„),+oo{Fn,f,Ff) ^0 asn^oo a.s. 
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Proof. The proof is based on a couple of inequalities that follow from 
Proposition 2.8 of Section 2. Namely, it will be shown that for all g > 1 
with some constant c > depending only on D and q and with an absolute 
constant K > we have 

Pr{3 / eJ^-.dnif, 2?2"/(2+«)^-2) < ^ ^^^^ > _ + a)6)} 

- q^-lt 

and 

Pr{3 / G J^:Sn{f; I)2a/{2+a)^-2) < ^ ^nd > (1 + ca)Ff{{l + 



2 



for all t > 0, (J G (0, 1] and 

To this end, given 5 > 0, define a function ip that is equal to 1 on (— oo, 6], 
equal to on [(1 + a)6, +oo) and is linear in between. Clearly, ip is Lipschitz 
with constant L = -K. Denote 

1 ^"/(2+a) 

where A = DL. 

Then Ff{5) >rl iff 5 > L>2°/(2+«)fj-2) ^^^^ hence, for 5 > 
^2a/(2+a)^-2) we also have 

P{voff>Ff{5)>rl. 

Define 



A„ := sup 



Then for 6 > <5„(/; L>2"/(2+")^-2) 

i^/(<5) < P{V o /) < (1 - A„)-ip„((/. o /) < (1 - A„)-'Fn,/((l + (t)S) 

and 

< Pn{^ o /) < (1 + A„)P((^ o /) < (1 + A„)F/((1 + a)6). 

To prove the inequalities it remains to obtain a bound for Pr{A„ > ca}, 
which is done using Proposition 2.8. First note that since 99 is Lipschitz with 
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constant L, we have for the class ip o J- := {(p o f : f ^ J-} 

Ve>0 \ogN{^o:F-L2{Pn)]e)<\ogN{T;L2{Pn);e/L) 



By Theorem 3.1, 



E sup \{Pn-P){ifof)\<C 
P((/3o/)2<r2 



40/2 AO 



n 



Under the assumption 



^Q/(2+a) 



the first term dominates, so we have 

En:=suY>^E sup \{Pn- P){^o f)\<C^r-^-^'^ = Ca, 
by the definition of r„. Using Proposition 2.8, we get 



Pr| A„ > qCa + 2qJ-^{l + IQCa) 

2gt 



V 



Now if 



nrl\og{t/{nrl{l + 16Ccj)) V2) 



< A' 



q^-lt 



d< 



t(2+a)/2a 

then, by a simple computation, < cr^, so we easily get with some constant 
c> and for a G (0, 1] 



Pr{Ar,> ca}<K^—-je 
q'^ — It 



-t/Kq^ 



The inequalities now follow. We will use them for 5j = q G [n to 
prove that on an event E with 

Pr(i=;) > A-log,Mn(t))-2^ie-*/^«' 

we have 

Vj V5 G <5,] < < (1 - ca)~^F.^j{{l + a),^^) 

<(l-ca)-iF„j((l + a)<75) 
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and 

Vj y5e{Sj+i,6j] F^jiS) < E^j{6j) < {1 + ca)Ff{{l + a)6j) 

< {l+ca)Ff{{l+a)q6), 

which imphes that on this event 

Cd 

sup M5„(/;dW(2+<:«)^-2)^^,^(j) {Fnj; Ff) <{aq + q-l)V 
Choosing t = tn = 2Kq^ log n and using the Borel-Cantehi lemma we get 

CCJ 

limsupsupAf^ (.^2<./{2+a)^-2)^, < (frg + g-ljV- a.s., 

n— >oo /gJT i — C(T 

and since o" > and q> 1 are arbitrary and, under the condition A„ ^ oo, 
for large enough n, 

*n(/;A„)><5n(/;I)'"/('+"V-2), 

we get 

supM5„(/.A„),A„(t„)(-P;,/;i^/) ^0 a.s. 

It now follows from the definitions that to prove 

supM5„(j.A„),+oo(-Pk,/;-F'/) ^0 a.s. 

it suffices to check that 

sup MB„,+oc{Fn,f; Ff) ^0 a.s. 

for any sequence Bn such that 



OO. 



We have by conditions 



Tn:=snpP{f>Bn}^0 



and it also follows from (3.1) in [25] that a.s. 

r]n := sup Pn{f > Bn} ^ 0. 

For all f € J- and all 5 > Bn , we have 

Ff{5)>l-Tn and > 1 - 7?„. 
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Let c > 1. Then a.s. for all large enough n (such that Tn ^ I — c ^ and 
?/n < 1 - c~^) for a\l feJ" and ah 5 > Bn 

Ff{6)<l<cF^j{6) and Fr,j{S) < I < cFf{5), 

implying 

sup MB„,+oo(.Fnj; Ff) < logc< c - 1, 
and the result follows. □ 



7. Applications II: excess risk bounds in empirical risk minimization. In 

this section, we discuss the problem of minimizing Pf over the class that 
is interpreted in learning theory as a risk minimization problem (e.g., in 
the regression or classification setting). Since the distribution P is typically 
unknown, it has to be replaced by empirical risk minimization 

Pnf — * min, f £J^. 

For simplicity, assume that /„ is a precise solution of the above problem, 
that is, it is an empirical risk minimizer (the results can be easily modified 
if it is only an approximate solution) . Given f £ J^, let 

£pif):=Pf-mfPg. 

This quantity is often called the excess risk of /. It is of interest to obtain 
bounds on the excess risk £p{fn) of the empirical risk minimizer /„. It is 
also of interest to have some control of the ratios ^p(^f^ uniformly in !F. 

The bounds given below are modifications of recent results of Koltchin- 
skii [26]. Let 

be the (5-minimal set of P. For 

pl{f,g)>P{f 

define the diameter of the set J- {6) 
D{5):=Dp{5) 

Also define 

M5)-=F sup \{P^-P)[f-g)\. 

Let 

« / X '^n(p) 

f3n[r) := sup 

p>r P 



9?-{P{f-9))\ 



: sup pp{f,g). 
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and 

A(r):=sup^. 

p>r P 

Finally, for s > 0, denote 



7n(r, s) := I3n{r) + 2,/— (A(r) + 16/3„(r)) 
V nr 



2s 

V 



nrlog((s/(nr(A(r) + 16/3„(r)))) V 2) ' 

Theorem 7.1. There exists a constant K > such that for q> 1, s > 
and r > satisfying the condition 

qin{r]s) < 1, 

the following inequality holds: 

(7.1) Pr( sup ^ii/1-1 >g7„(r,s)j<i^^-e-«/^^. 

I fer £p{f) J q-ls 

£pU)>r 

Moreover, let fnGJ-bea data- dependent function such that 

SpAfn) < {l-qjn{r;s))r. 

Then 

(7.2) Pr{fp(/n) >r}< K^-e-'""'. 

q — 1 s 

In particular, (7.2) holds for /„ = /„. 

Proof. As before, denote 

Pj-=rq^, J = l,---,^, 
with I being the smallest natural number such that pi>l. Let 

:Fj:={f-g:f,geJ'{p,)}. 
The key ingredient of the proof is the following inequality: 

iPn^ll^, . , J g 1 

Pj 



(7.3) Pr( max > 7„(r,s)l < K^—-e^'/^^'^. 



Its proof is a straightforward modification of the proof of (2.4a) of Theo- 
rem 2.1 with further bounding as in (2.11) of Proposition 2.8 (but taking 
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Sj = sq^), so we skip the details of the derivation. The only difference is that 
the bound on 

n 



1 



VniPj) := -E 



n 



i=l 



now involves the diameter of the set Ti^pj ) : 



Now on the event 



K(/0j)<^'(/0j) + 16^n(Pj 



\\Pn-P\\T, ^ , . 

max < 7n('^, s) 

i<j<l Pj 



we have for all 1 < j < / the following implication: 

feT{p,)\T{p,-i) =^ \fae{0,p,)ygeT{a) 
£p{f) < P{f-g)+a 

< Pnif-g)+a+\\Pn-P\\:F^ 

< £Pnif)+^ + Pjln{r,s) 

< £pM)+<^ + q£p{fhn{r,s). 

Since a > is arbitrary, this implies that on the event E for all / G .7-" with 
£p{f)>r, 

£pAf) 



£p{f) 



> 1 -g7„(r,s). 



Since £p„{fn) = 0, under the condition 1 — g7„(r, s) > 0, we must have on 
the event E Sp{fn) < f- Therefore, we have on the event E the following 
implication: 

= Pnf-Pnfn<Pf-Pfn + \\Pn-P\\r, 

< £p{f) + p,7„(r, s) < £p{f){l + q-fn{r, s)), 
which means that on the event E for a\l f £ with £p{f) > r 

SpM) 



£pif) 



< 1 +g7„(r,s). 



Since by (7.3) 



Pi{E^)<K^^-e-'/^i 



q — 1 s 

inequality (7.1) now follows. Inequality (7.2) is an obvious consequence 
of (7.1) since the assumptions 

SPnifn) < {1- q-fn{r;s))r 
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and £p{fn) > r lead to 

£p{fn) - ''^"^ ' ^ □ 

If we define 

uJn{S):=E sup \{Pn-P){f-g)\, 
Lae:F,pp{f,9)<s 

then 

As a result, under the assumptions 



n 



-1/2 



and 

D{5) < 

with some C > 0,/9 £ (0, 1), k > 1, Theorem 7.1 gives a convergence rate of 
£p{fn) to of the order 

a very typical rate in regression and classification problems. 

7.1. Regression. For simplicity and in order to directly use the above 
bounds, we consider only regression models with bounded response. Let 
{X, Y) be a random couple taking values in S x [0, 1]. The regression function 

go{x):=E{Y\X = x), x e S, 

takes its values in [0, 1] and minimizes the functional g ^ E{Y - g{X)f . The 
problem of estimating gQ becomes a risk minimization problem Pf — > min 
if one defines P as the distribution of {X, Y) and relates to each g on S the 
function / on 5 x [0, 1] as follows: 

f{x,y):=fg{x,y):={y-g{x)f, {x,y) e S x [0,1]. 

Given a class Q of measurable functions from S into [0, 1] and a sample 
{Xi,Yi), . . . , {Xn,Yn) of i.i.d. copies of {X, Y), one can define a least-squares 
estimate of go as a solution ^„ of the following minimization problem: 

n 

n'^^O^j - 9{Xj)f — >min, geG, 
i=i 

which is equivalent to minimizing Pnf over the class J- := {fg ■ g G}, Pn 
being the empirical measure based on the sample {Xi,Yi), . . . , {Xn,Yn). This 
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will allow us to use the bounds of Theorem 7.1. First suppose that go & Q- 
Then, by a simple and direct computation, 

£p{fg) = E{Y - g{X)f - E{Y - go{X)f = \\g - 5o|lL(n), 

where 11 is the distribution of X . Therefore, 

m = {/ €J^:£p{f) <5} = {fg : - <7o|lL(n) < S}- 

Also, if (7i,52 G ^, then 

Pif,. - h.f = my - gi{X)f -{Y- g2{X)ff 

= E{gi{X) - g2{X))\2Y - gi{X) - g2{X)f 

< M\9i - g2\\l^(u) =:Pp(/9i 1/92)5 

since Y,gi{X),g2{X) G [0,1]. It immediately follows that the pp-diameter 
of ^{6) satisfies the following bound: D{6) < 4^/6 and as a result we have 
A(r) < 16. 

Next, the usual symmetrization inequality gives 

MS)=E sup 

9i,g2eg,\\gi-go\\l^^-^-^<S,\\g2~go\\l^^j^^<S 



< AE sup 

3eg,n(g-go)2<<5 



n 



-^Y,e,{{Y,-g{X,)f-{Y,-go{X,)f) 



i=l 



and, using a Rademacher comparison inequality (e.g., [29], Theorem 4.12), 
this can be bounded further by 



8E sup 

5ee;,n(g-5o)2<<5 



n 



i=l 



= :Vn(5). 



The inequality of Theorem 4.12 in [29] is used as follows: for fixed Xi,Yi, 
define Ai := {Yi — go^Xi))"^, (l)i{u) := {Ai — u)^ — A^ and, using the fact that 
(pi are Lipschitz functions on [0, 1] , upper bound 



E, 



sup 

geg,n(g-go)2<5 



n 



i=l 



Define now 



/3n{r) := sup 



'Y.ei4>i{g{Xi)-go{Xi)) 

i'nip) 



p>r P 



and 



%ir, s) := P^r) + 8,/^(l + V ^ ^ 2) ' 
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Theorem 7.1 immediately imphes that as soon as go £ Q and q^n{r,s) < 1 
we have 



(7.4) M\\9n-9o\\l,m)>r}<K 



Clearly, a similar bound holds for approximate least-squares estimates (as 
in Theorem 7.1). It is also possible and easy to handle the case go and 
to bound in this case ~ doWL^iTi) 



with high probability, but we do not give this type of bound here (see, e.g., 
[26]). We conclude this brief discussion of regression problems with a couple 
of specific examples where the expectation bounds are used to derive the 
value of r in (7.4). 

Example 7.2. Let ^ be a VC-subgraph class of measurable functions 
from S into [0, 1]. Let Fs:S ^ [0, 1] be a measurable envelope of the class 
{g-go- n(5 -go)^ <6}. Denote 

Applying Theorem 3.1 to VC-subgraph classes gives 



K 



MS) < ^j6logT{5), 



assuming '"^J*-*^^ > n. Therefore, we have (under a natural assumption that 



<5 

the function 6 i-^ jg nonincreasing 



/n V r 

for r larger than or equal to the solution r„ of the equation 

log r(r) 



n. 



Then for r = C{rn + ^) with large enough C and for q = 2,we have Q'7n('", s) < 
1 and the following bound holds for the least-squares estimate Qn- 

Pr{ Bn - 50 llL(n) > C (r„ + ^) I < ^ie^/^. 

Since t{6) < this always gives the convergence rate at least as good 
as 0(i^^) for least-squares estimators picked from VC-subgraph classes. 
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However, if t{5) is smaller, one can get an improvement on the logarithmic 
factor. In particular, if ^ is a subset of a finite-dimensional space of functions 
on S of dimension d, then one can find an orthonormal system of functions 
ei, ... ,6(1 in -^2(11) such that Q C l.s.(ei, . . . , e^). Then we have 

d 

sup \g-9o\{x)= sup 



< sup 

\ai — u 



If we set 



/ d \ 1/2 

<Vs(j:e]ix)j . 



/ d \ 1/2 / d 



1/2 



1/2 



Al, 



this implies H-F^H < Vd^ and as a result t{6) < \fd, which gives the correct 
convergence rate 0{n~^). 

Example 7.3. Let Q denote the set of all monotone step functions from 
[0, 1] into itself with a finite number of jumps. For a fixed go & say with m 
jumps, the class {g — go'-d^G} is VC-major (go defines a partition of [0, 1] 
into m intervals; on each of these intervals g — go is monotone and hence 
{{g — go ^ t} '■ g ^ G it & R-} is a VC class with VC dimension depending on 
m). Arguing as in Example 3.8, we can show that 

V;„(<5)<— V^(log-j Moglog-' 



n 



V 



K( l\3/2 1 ^Vlog^ 

— I log - I log log - V , 



n 



5 J 



n 



which implies 



/3n(r) < 



K 



K 



nr 



1x3/2 



log-) flog log ^"^ 



V — ( log - J log log - V 



1 v^Iogn 



nr 



Let us take q = 2. Then it is easy to conclude that q^n{r, s) < 1 as soon as 

, s + (log n)^/2 log log n 



r>C- 



n 
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with sufficiently large constant C (which will depend on the number of jumps 
of 50 0- Hence, if we take an estimate gn such that 

n n 



« ^ E(^i - ~9n{X,)Y < inf n"i 5^(y, - giXj)y 

(logn)'^/^ log logn 



+ 



2n 



then Theorem 7.1 implies that 

Pr||bn-5o|lL2(n) >'^(5o) / - ' 

with some constants C{gQ) and K. In particular, the bound implies that 

' (log n)'^/^ log log n ^ 



E\\gn - go\\l2{n) - O 



n 



Since the constant C{go) tends to infinity as the number of jumps of the 
function go tends to infinity, the above bound cannot be made uniform in 
go ^ Q (and, in fact, the convergence rate of sup^^^gg E\\gn — S'o|li2(n) is 
much slower for any estimator gn). Results of this type (in a slightly different 
context and with an improvement on the logarithmic factors) can be found, 
for instance, in [45] and references therein. 

7.2. Classification. In classification problems, one deals with random 
couples {X,Y) in 5" X {0,1}, X being an observable instance and Y an 
unobservable binary label assigned to this instance. Functions g from S into 
{0, 1} are called classifiers. The generalization error of a classifier g is defined 
as 

Fr{Yy^g{X)} = P{{x,y):y^g{x)}, 

where P is the joint distribution of {X, Y). It is well known that the minimal 
possible generalization error (the Bayes risk) is attained at a classifier 

go{x) :=/(??(x)>l/2), 

where rj(x) := E{Y\X = x) is the regression function. Since the distribu- 
tion P of {X,Y) and hence also the regression function rj are unknown, a 
reasonable approach to classification is to minimize the training error 

n 

n-1 J2 ^(^i ^ 9{Xj)) = Pn{{x, y):y + g{x)}, 

based on i.i.d. training examples sampled from P, over a suitable class Q of 
classifiers. For simplicity, we assume that go^Q. Denote g^ a minimizer of 
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the training error over the class Q. Thus, the classification problem becomes 
a version of empirical risk minimization and one can use Theorem 7.1 to 
study the size of the excess risk 

£{9n) ■■= P{{x,y) -.y gn{x)} - P{{x,y) -.y go{x)}. 

As before, 11 denotes the distribution of X. If, with some k>1 and c > 0, 
for all g £ G 

P{{x, y):y^ g{x)} - P{{x, y):y^ g^ix)} > cW'ix : g{x) / go{x)}, 

then the diameter D{6) < C5^^^'^'^\ This holds, for instance, if for all i > 

n{x : < \r]{x) - 1/2| < t} < 

and in this case k = [44]. Under the standard condition that the e- 
entropy of the class G grows as 0{e~'^f) [with several possible kinds of en- 
tropy involved and with p £ (0, 1)] Theorem 7.1 yields a bound on the excess 
risk of the order 0(n~''/(^''+''~^)) as in [44]. The main difference with the L2- 
regression problem where k = 1 is that in classification n can take any value 
greater than or equal to 1 leading to the whole spectrum of convergence 
rates. If there exists h> such that 

WxgS \r]{x) -l/2\>h, 

then it is easy to see that 

P{{x, y):yi= g{x)} - P{{x, y):y^ goix)} > chYi{x : g{x) 7^ go{x)}, 

so we do have k = 1. This case of well-separated classes was looked at in the 
recent paper of Massart and Nedelec [34]. Define fg{x,y) := I{y / g{x)) and 
:= {fg :g gG}- We are using the distance 

PPiU,fg,):=n'^\gi-g2f. 
Then we have the following bound for the diameter D{6): 

implying 

A(r)<-^. 

Suppose that C := {{g = 1} : G ^} is a VC class and Cq := {g^ = 1}. Define 
a local version of Alexander's capacity function: 
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Then 



and as a result 



To satisfy the condition q^n{f, s) < 1 (say, with q = 2) it is enough to take 



C 



hip 



V 



s 
nh 



where ip denotes the inverse of the function 

log r(r) 



and C is a sufficiently large constant. Now it is easy to check that 



fnh?\ 1/ , f V 



yielding the following bound: 
Pr(£:(5n) > C 



V , 

— logr 
nn 



V 



n. 



n 



s 

nh 



If we replace T(r) by the trivial upper bound this gives one of the results 
of Massart and Nedelec [34]: the excess risk is bounded by 

nh'^ 



V 

K—\oi 
nh 



V 



In the case of smaller r, it is a slight improvement of their bound. It is easy 
to see that for some classes of sets and probability measures P r can be 
even bounded, leading to the bound on excess risk of the order O(^). For 
instance, as in Section 4, suppose that S = [0, 1]*^ for some d > 1 and P has 
a density that is uniformly bounded and bounded away from on 5. As 
before, h is the Hausdorff distance between subsets of S. Let C be a VC 
class of convex subsets of S and Co G C, P{Cq) > 0. Suppose that for some 
K>0 

K-^h{C, Co) < P{CACo) < Kh{C, Co), CeC. 

Recall that the upper bound always holds for convex sets ([17], pages 269- 
270), but the lower bound holds only for special classes of sets (balls, rect- 
angles, etc.). Then the function r is uniformly bounded. The proof easily 
follows from the same type of argument as in Section 4 (before Example 4.9). 
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